Welcome to the Inedo Forums! Check out the Forums Guide for help getting started.

If you are experiencing any issues with the forum software, please visit the Contact Form on our website and let us know!

Package license definition



  • ProGet Version 2022.16 (Build 7) (Inedo Hub) - Trial version

    Hello,

    we are evaluating ProGet as our package manager, and license control is one of the critical features that ProGet gives us.

    In the first installation, I remember seeing the possibility to specify the license that belongs to a package, when the ProGet is not able to detect.

    Now, that I reinstall everything using a trial version, I don't see any possibility to specify the license of a package when ProGet is not able to detect the license. Even, when I go to the metada of the package, there is no possibility to specify the license.

    I'm missing some configuration ?

    Best Regards,
    Pedro Magno


  • inedo-engineer

    Hi Pedro,

    There are multiple ways that an author can specify a license on a NuGet package:
    https://blog.inedo.com/nuget/nuget-license-expressions

    Or, a package author can specify no license at all. If the author chooses "file" as the license type, then ProGet will only be able to "see" this license if the package is in ProGet - either as a Cached or Local package.

    For example, the SmartInspect package has a file" type of license agreement:

    79f25c48-e2c4-4d72-babb-eeb503cee88e-image.png

    So in this case, you want to read the "embedded license file", then assign a license agreement code to it.

    Note that, if a package file has not been downloaded yet, then it will appear to Proget as having no license at all. This is a NuGet API limitation.

    53a8a451-3e6b-4b57-85bf-a4f14482f8bc-image.png

    Cheers,
    Alana



  • Hello Alana,

    Thank you for the clarification.

    The problem is, I don't see the option to assign the license type to Package as you can see in the two examples below.

    First Example
    Accord Package
    This package has a file license as you can see here in the picture below
    5866ef36-2eac-4cb3-867d-b378e37c1ca4-image.png

    Because the ProGet is not able to read the file to identify the license (LGPL), the license doesn't appear in the main feed and cannot be allowed/restricted by our rules:
    cda6e72b-2637-4157-8e49-6dd1c7b6c756-image.png

    The same is true for some other packages, for example, Newtonsoft.Json.Schema.

    So, I would like to be able to specify manually the licenses for these type os packages.

    Second Example
    For a package without any license information, the option to assign the license type to Package is not appearing at all

    56980e94-ef98-4bab-af10-6ac09a5b8980-image.png

    e89dbb7a-ecd5-49c8-b136-2c2eb75df26e-image.png

    Best Regards,
    Pedro


  • inedo-engineer

    Hi @pmsensi ,

    Oh I see! In this case, I think your "Feed Usage" setting is currently set to "Private". You should set this to "Public" packages- then the licensing will be displayed/configurable.

    Alana



  • Hello @atripp ,

    yes, that was the reason :)

    Is possible to filter all packages without a license defined ?

    Best Regards,
    Pedro


  • inedo-engineer

    Hi @pmsensi ,

    Yes, you can set up this as a licensing rule - to block packages with unknown licenses.

    Reporting & SCA > Licenses > manage license types > Manage Default License Rules

    :)

    Cheers,
    Alana



  • Hello @atripp,

    yes, I know that I can do that, and is really one important feature for our evaluation :)

    But what I would like to have is, a list of packages without licenses, so then, I can go to all packages without the license and assign them manually.

    The only possibility that I see to do that is, to go throw all available packages in the feed and check if the license info is available or not.

    So, would be nice to have a list somehow, is that possible ? Can I get that information from database ?

    Best Regards,
    Pedro


  • inedo-engineer

    Hi @pmsensi ,

    That is the general process you'll want to use... basically just assign the licenses as you need them. In general, as part of a package approval workflow:
    https://blog.inedo.com/nuget/package-approval-workflow

    There are over 300k packages on nuget.org (5M+ versions), and growing. So it's many packages. ProGet does not download a list these packages, but displays live data from the NuGet.org feed.

    hopefully that helps :)

    Cheers,
    Alana



  • Hi @pmsensi,

    we are facing a similar challenge. Basically, what I am interested in is a list of packages that we are currently using in our products that don't have a known license assigned to them (or: Proget was unable to identify the license, e.g. because the license info just states "see licenense.txt" or something like that). I think that you are looking for something similar and are not asking to analyze all packages that exist on Nuget or npm, because that would probably would not be feasible (as @atripp already mentioned in her reply).

    We are using the new Reporting & SCA feature extensively, and we are getting some very nice license statistics out of it. However, the one thing that is currently missing is the number or list of packages with an unknown license. I hope that this will be added in a future release. In the mean time, looking at the database can be a workaround to at least get a list of the package name.

    I think starting with 2022, ProGet started storing license infos of known packages in the database. I am assuming that that info is stored for packages that have been downloaded via ProGet (i.e. "cached") or maybe also for packages that have been reported to ProGet via pgscan (@atripp please feel free to add some info to this).

    One simple approach to get the license info of all "known" packages, would be this:

    select 
    	pi.Package_Name, l.External_Id, l.Title_Text
    from 
    	PackageIds pi
    left join 
    	PackageLicenses pl on pl.Package_Id = pi.Package_Id
    left join
    	Licenses l on pl.License_Id = l.License_Id
    group by pi.Package_Name, l.External_Id, l.Title_Text
    order by 1;
    

    Note that this select groups by package names and ignores versions. You could now simply look for packages without a known license like this:

    select 
    	pi.Package_Name, l.External_Id, l.Title_Text
    from 
    	PackageIds pi
    left join 
    	PackageLicenses pl on pl.Package_Id = pi.Package_Id
    left join
    	Licenses l on pl.License_Id = l.License_Id
    where l.External_Id is null
    group by pi.Package_Name, l.External_Id, l.Title_Text
    order by 1;
    

    However, if you look at the result of first query, you might notice that there might be multiple entries for some packages. That might be because a package has changed its license or just its license info (maybe switching from a "see license.txt" to an actual SPDX tag), or maybe because there was a bug in previous versions of ProGet that has been fixed in 2202.18 (https://inedo.myjetbrains.com/youtrack/issue/PG-2263). So you might get some false positives with this approach.

    To get a list of packages without any entry of a known license, we have to eliminate the ones with multiple entries. There might be a better and more readable way to get this done, but the query that worked for us is this one:

    select 
    	distinct pi.Package_Name
    from 
    	PackageIds pi
    left join 
    	PackageLicenses pl on pl.Package_Id = pi.Package_Id
    left join
    	Licenses l on pl.License_Id = l.License_Id
    where l.External_Id is null
    and pi.Package_Name in 
    (select Package_Name
    from 
    (
    select 
    a.Package_Name
    from 
    (
    select 
    	pi.Package_Name, l.External_Id, l.Title_Text
    from 
    	PackageIds pi
    left join 
    	PackageLicenses pl on pl.Package_Id = pi.Package_Id
    left join
    	Licenses l on pl.License_Id = l.License_Id
    group by pi.Package_Name, l.External_Id, l.Title_Text
    
    ) a
    group by a.Package_Name
    having count(*) = 1
    ) b);
    

    BTW: in case you are curious: that select gives us a list of 1222 packages.



  • Hello @sebastian ,

    yes, that is my current problem too! One of the main reasons to buy the ProGet is the nice Reporting & SCA feature! I hope that functionality will be added in the future too!

    Thank you for the clear explanation and sample queries, I will use them for sure!

    1222 packages are a lot! For now, we do not have many packages, but we will have them for sure :)

    @atripp FYI, this is what I was looking for :)


  • inedo-engineer

    Thanks much for that providing query @sebastian !

    And yes -- the data is "halfway there" in ProGet 2022 (maybe 20% there?), but "Packages" (which are spread across another different tables like NuGetPackages, NpmPackages, etc.) aren't tied to licenses just yet.

    But with the database refactoring we are planning in 2023, it's going to be a lot easier to get and display this information, especially for SCA-related things.

    There's also going to be vulnerability improvements as well - please stay tuned :)



  • @pmsensi said in Package license definition:

    1222 packages are a lot! For now, we do not have many packages, but we will have them for sure :)

    Yeah, we have been using ProGet for a while now... :-)

    I'm pretty sure some are false positives, though. It seems that when a package has been downloaded, but there is no usage of it in any product (happens for infrastructure packages like angular/core, angular/cli or some Visual Studio extensions), there is no license entry in the database even though the package has a perfectly fine SPDX tag.

    What's driving me nuts at the moment is that Microsoft seems to be using embedded license info files for quite a number of their packages. Assigning licenses to those is going to take a while...


  • inedo-engineer

    @sebastian said in Package license definition:

    What's driving me nuts at the moment is that Microsoft seems to be using embedded license info files for quite a number of their packages. Assigning licenses to those is going to take a while...

    Oh really? They even have their own accepted SPDX code... whyyy Microsoft 🤦

    In the past, we thought of adding a kind of wildcard URL for licenses, like a "package://Microsoft.*" => "MSPL" would basically associate all packages with that prefix that don't otherwise have a SPDX code, or an explicit license.

    Wonder if that would help here?



  • @apxltd

    I don't know if using the package name will be enough... There are a lot of packages from Microsoft, that don't start with Microsoft.

    Would be nice to be able to filter the licenses based on other metadata, for example, owners.

    At the moment, in our company, we use TrustedSigners to allow/block the installation of some packages from external sources, as you can see below. So, maybe being able to assign licenses by owner will be a big win for us here, I don't know @sebastian thoughts about this approach.

      <trustedSigners>
        <repository name="nuget.org" serviceIndex="https://api.nuget.org/v3/index.json">
            <certificate fingerprint="0e5f38f57dc1bcc806d8494f4f90fbcedd988b46760709cbeec6f4219aa6157d" hashAlgorithm="SHA256" allowUntrustedRoot="false" />
            <certificate fingerprint="5a2901d6ada3d18260b9c6dfe2133c95d74b9eef6ae0e5dc334c8454d1477df4" hashAlgorithm="SHA256" allowUntrustedRoot="false" />
            <owners>Microsoft;dotnetfoundation;aspnet;Microsoft Corporation;confluent</owners>
          </repository>
      </trustedSigners>
    


  • @apxltd Using URLs for packages would be a nice feature, especially as package URLs contain version numbers. Getting rid of those might already be helpful, because some package come in a rather large number of versions. Of course, even applying a wildcard to just the version of a package might lead to wrong results, because in theory a package could change it's licese from one version to the next, but that is probably not a very realistic scenario.

    However, things can become messy when different wildcard URLs could be applied to the same package. Unfortunately, not all Microsoft packages use MSPL. Some use MIT, some use the proprietary licenses... It's a real mess. So it probably wouldn't be as easy as making just one rule for microsoft.*. But still, using wildcards could make things a bit easier.

    @pmsensi We haven't used TrustedInstaller yet. The approach is interesting, and yes: it would probably make sense to have a central service like a ProGet server check package owners, but it would be a completely different approach to a very different problem. I don't think applying licenses to package owners makes a lot of sense, because - as written above - package owners like Microsoft can apply different licenses to different packages.

    That being said, having a new entity "package owner" or "publisher" or something like that and being able to filter for that entity could be a cool new feature. This could also be used in the SBOM reporting feature (like: 40 packages come from Microsoft, 20 from vendor A, 7 from vendor B).


  • inedo-engineer

    @sebastian @pmsensi

    Thanks for added insights! This doesn't seem as simple as I had hoped...

    The current solution in ProGet now (i.e. packageid:// and package:// urls that can be associated with license codes) feels hacky, but seems we don't have many options.

    Well, one thing that might work.... submitting a pull request to those packages to add the MIT license code to their project file? It's probably just an oversight on their part...


  • inedo-engineer

    @sebastian said in Package license definition:

    That being said, having a new entity "package owner" or "publisher" or something like that and being able to filter for that entity could be a cool new feature. This could also be used in the SBOM reporting feature (like: 40 packages come from Microsoft, 20 from vendor A, 7 from vendor B).

    We considered doing "something" with this metadata ages ago, but found that a lot of packages (including npm, etc.) have multiple owners/authors. On the top page of nuget.org, just 3/20 seem to have a single author

    In general, the human-driven "package approval workflow" seems to be the best bet. Maybe p[ainful at first, but "not too bad" in the long-run



  • @apxltd said in Package license definition:

    In the past, we thought of adding a kind of wildcard URL for licenses, like a "package://Microsoft.*" => "MSPL" would basically associate all packages with that prefix that don't otherwise have a SPDX code, or an explicit license.

    Wonder if that would help here?

    A colleague of mine actually had an interesting idea today: How about calculating hash values (like SHA-1) for embedded license files and assigning licenses to those hash values? That way one would only have to assign licenses to each license text once (if the license text is identical across different packages or package versions).

    The workflow would be similar to assigning licenses to actual packages / versions, but instead of adding an pseudo URL like "packages://SomeVendor.SomePackage/1.2.0" we could do something like "hash://0xA1B2C3", where 0xA1B2C3 would be the hash value of the content of the license file. All other packages / versions with the exact same license text would automatically be mapped to the same license.

    Of course, some license texts include the name of the product or a copyright note, so we would still get multiple entries for the same license. But it should be significantly less than adding one entry for every package / version.

    What do you guys think about this idea?


  • inedo-engineer

    Hi @sebastian

    Thanks for the idea -- yes, I think it's an interesting approach!

    We explored it a while ago, and this was where we ended up...

    1. It's even more confusing to use than packageid://, so we'd need to find a better UI solution

    2. We'd want to store the full license text as well, so it'd be easy to confirm the contents

    3. This is all a nontrivial engineering effort

    4. We're not sure how many packages this would impact and how much value / time savings it would represent

    5. None of this would even work for remote packages, which is by far what most users find confusing and have issues with

    6. It would probably require less engineering effort to scan/query all packages on NuGet and make a "database" of package licenses using a little human intelligence

    7. It would require even less effort to just ask package authors to specify license codes, and then eventually the problem will go away on its own probably

    And then we gave up because there were more priority things to address ;)

    Cheers,
    Alex



  • HI @apxltd,

    1. It's even more confusing to use than packageid://, so we'd need to find a better UI solution

    Would it be more confusing to have two or three hash values instead of dozens, maybe hundreds of packageid:// entries for a given license? People are getting used to using hash values (e.g. with GIT commits). One would have to be able to view the original license text, of course (see next point)...

    1. We'd want to store the full license text as well, so it'd be easy to confirm the contents

    Agreed. Otherwise, there is no way of confirming that the hash value was assigned to the correct license. But I don't think that would be infeasible. One could either store the content of the license file in a dedicated table in the database, or in file. And we only have to store it once per hash value. A single license file takes what, 1 maybe 2 KB? Let's say we will have a dozen, maybe a hundred different license files at the end (unlikely, probably more in the lower double digits area). That would take less than 1 MB in total.

    1. This is all a nontrivial engineering effort

    Agreed, but it's not too complex either. There is obviously already code in Proget which detects that there is an embedded license file and that can display the content of that file. I'd say you are probably almost halfway there :-)

    1. We're not sure how many packages this would impact and how much value / time savings it would represent

    Here is an example: Consider the Google.Apis.* packages (https://www.nuget.org/packages?q=google.apis). They all have the exact same Apache 2.0 license file embedded, so we would have to assign just a single hash value instead of dozens / hundreds of individual packages.
    Another point is updates of packages. At the moment we would have to assign a license to every new version of a package. I think this feature could be a huge time saver.

    1. None of this would even work for remote packages, which is by far what most users find confusing and have issues with

    Yes, for this to work, Proget would have to download the given package and read to content of its license file. However, there are two major parts of Proget where license become relevant, and I believe the it's not an unrealistic scenario that Proget has downloaded the package in question in both scenarios:

    1. Blocking packages. We want to prevent users from downloading packages with specific licenses. Let's assume a user wants to download X with version 1.2.3, and that package has an embedded license file. Now, to be able to serve the package to the user, Proget has to download it first, right? Either it has done so already and the package is cached, or Proget has to download it on demand. In both cases Proget has a chance to read the license file and compute a has value for its content, if it hasn't already done that before.
    2. Reporting. Most packages that are analyzed by the SCA feature should have been downloaded in the past via Proget.
    1. It would probably require less engineering effort to scan/query all packages on NuGet and make a "database" of package licenses using a little human intelligence

    I think downloading all packages (including all versions of each package) and analyzing them would take a lot of effort and resources.

    1. It would require even less effort to just ask package authors to specify license codes, and then eventually the problem will go away on its own probably

    Will try this, starting with the Google folks...



  • @sebastian said in Package license definition:

    1. It would require even less effort to just ask package authors to specify license codes, and then eventually the problem will go away on its own probably

    Will try this, starting with the Google folks...

    One more thought on the last point: What about packages that do not use a supported open source license? Think of Oracle.ManagedDataAccess. There simply is not SPDX tag that matches their proprietary license

    That package is actually a good example for a scenario where hash-based license assignments would make a lot of sense. They update their license every now and then. Last update was from version 21.7.0 to 21.8.0. A hash-based license check would have caught that and labeled the new license as "unknown". As a result, version 21.8.0 would have been blocked until someone reviews the new license and manually approves it.


  • inedo-engineer

    Thanks for the additional thoughts @sebastian!

    I agree... it's not totally infeasible from a technical standpoint, but it's still pretty tricky. Just to comment on a technical thing, FYI...

    Now, to be able to serve the package to the user, Proget has to download it first, right?

    Actually, ProGet "streams" content from connectors. This means that, when a user requests a package from ProGet (and that package is on a remote connector), ProGet will then request a package from the connector. As the file is being downloaded, ProGet will send the same data back to the user and optionally write that data to disk. If we didn't do this, ProGet would be basically unusually slow.

    A ZIP archive (what package files use) use a tail index, which means you have to read it backwards from End of File. So it's not possible to read an embedded file unless we've downloaded the entire package.

    There are a few other "gotchas" we'd need to consider, even for cached/local packages. For example, we can't open/seek the package file just to know the license and if the package should be blocked - especially when it comes to cloud storage (for the same reason - tail indexing). So, we would obviously need to store package license file info in the database too... but then we'd need a way to deal with existing packages on disk that don't yet have that info.

    We may also want to add some sort of heuristic analysis of license text, even if it's simple as a basic distance check. Personally I think that's a bad idea to rely on... but other products do, and the reality is most users would just skim a license anyway.

    This all becomes a lot easier after v2023 with centralized data and a package analyzer that can background scan all these, but still not trivial. And then there's the real hard part... the UI and documentation 😉

    We definitely don't want to hack something in like packageid:// and package::// -- those have been a total pain and plus, I hate the design 😅

    Anyway -- just wanted to give more technical insight into why ProGet behaves like this, and why I'm hesitant to jump on the "reading license file" approach without adding somethign that's a lot more valuable than what we have now.



  • Hi @apxltd, thanks for the insights!

    I half expected this to be way more complicated than I had hoped for, but one can dream...


Log in to reply
 

Inedo Website HomeSupport HomeCode of ConductForums GuideDocumentation