Revising the Eclipse IP Due Diligence Process for Third Party Content

Thursday, May 7, 2020 - 16:21 by Wayne Beaton

The Eclipse Foundation’s board of directors approved an update to the Eclipse Foundation’s Intellectual Property (IP) Policy in October 2019. With help from some of our open source project teams and the Eclipse Architecture Council, we’ve been defining, refining, and rolling out updates to our corresponding IP due diligence process and supporting services. A big part of this update is a change in the way that we will manage third party content.

In the context of the Eclipse IP Policy, “third party content” is content that is leveraged by the Eclipse open source project, but not otherwise produced or managed by an Eclipse open source project. We use “content” in a general sense, independent of any particular technology or means of delivery; libraries, components, individual files, etc. are all considered content (and, content that comes from a third party provider is “third party content). A library produced by, say, an Apache open source project, is considered to be third party content.

Previously, the IP Policy required that all third party content must be vetted by the Eclipse IP Team before it can be used by an Eclipse Project. The IP Policy updates turn this around. Eclipse project teams may now introduce new third party content during a development cycle without first checking with the IP Team. That is, a project team may commit build scripts, code references, etc. to third party content to their source code repository without first creating a contribution questionnaire (CQ) to request IP Team review and approval of the third party content. At least during the development period between releases, the onus is on the project team to–with reasonable confidence–ensure any third party content that they introduce is license compatible with the project’s license. Before any content may be included in any formal release the project team must validate that the third party content licenses are compatible with the project license.

The “license compatible” part is important. The IP Policy change eliminates the notion of different types of IP due diligence for third party content. Henceforth, all third party content is subject to license certification only.

With this change to license certification only for third party content, we are able to leverage existing sources of information license information. The short version is that we’re getting out of the business of scanning through every single bit of source code ourselves, and will instead leverage what we’ve already learned and trust sources of license information (and contribute to these other sources of information).

We currently have two trusted sources of license information: The Eclipse Foundation’s IPZilla and ClearlyDefined. The IPZilla database has been painstakingly built over most of the lifespan of the Eclipse Foundation; it contains a vast wealth of deeply vetted information about many versions of many third party libraries. ClearlyDefined is an OSI project that combines automated harvesting of software repositories and curation by trusted members of the community to produce a massive database of license (and other) information about content. The Eclipse Foundation’s IP Team has been working closely with the ClearlyDefined team, providing input into their processes and helping to curate their data.

Both of these sources of information can be leveraged relatively easily to find license information on a single unit of third party content. But many (perhaps most) projects depend on long lists of third party content and identifying license information for a long list of dependencies can be cumbersome. Fortunately, some APIs are available that make it possible to query these data sources in bulk and we have developed a prototype License Tool that committers may use to query those APIs.

Before we can talk about leveraging APIs or using tools, we need to be able to identify the units of third party content. This is pretty natural for developers who are familiar with technologies like Apache Maven which has well defined means of uniquely identifying content in a well-defined software repository. Maven coordinates–which combine groupid, artifactid, and version (often referred to as a “GAV”)–identify a very specific bit of content (especially when combined with a specific source, e.g. Maven Central). For example, org.junit.jupiter:junit-jupiter:5.5.2 unambiguously identifies content on Maven Central that is known to be under the Eclipse Public License version 2.0 (EPL-2.0). Other systems use different means of identifying content: node.js, for example, uses a pURL-like identifier for content in the NPM repository (e.g., @babel/generator@7.62).

To bridge the gap between the different means of identifying content (and fill in some of the blanks), we’ve adopted the ClearlyDefined project’s five part identifiers which includes the type of content, its software repository source, its namespace and name, and version. ClearlyDefined coordinates are roughly analogous to Maven coordinates, but–by including the type and source–provide a greater degree of precision. ClearlyDefined uses slashes to separate the parts of the identifier, so the Maven GAV org.junit.jupiter:junit-jupiter:5.5.2 maps to maven/mavencentral/org.junit.jupiter/junit-jupiter/5.5.2 and the NPM identifier @babel/generator@7.62 maps to npm/npmjs/@babel/generator/7.6.2.

The License Tool was created with a command line interface that takes a list of dependencies as input and generates output that identifies the content that needs further scrutiny (i.e., it lists all content for which license information cannot be found in one of the approved sources). As a side effect, it also generates a DEPENDENCIES file (example) that lists all dependencies, along with their resolved licenses and some other information (this file will be used to augment intellectual property logs; more on this later). We created it with a command line interface so that the initial version of the tool could be used on a variety of technologies; at some point, it is our hope that–with some community help–we’ll be able to create, for example, a Maven plugin as well.

What the License Tool does is relatively simple:

It first sends the content list to IPZilla; if an entry has been vetted by the IP Team and has been approved for use, then it goes into the “approved” pile. If an entry has been vetted by the IP Team and flagged as somehow problematic, it will be put into the “restricted” pile.
It then sends the content that has not been flagged as either approved or restricted to ClearlyDefined; if an entry is known to ClearlyDefined and has a score of at least 75 and all discovered licenses are on our approved licenses list, then it goes into the “approved” pile.
Everything else goes into the “restricted” pile.

The License Tool CLI lists all content that ended up in the “restricted” pile as requiring further scrutiny. It is this list of content that requires further scrutiny that the project team must engage on with the Eclipse IP Team.

For many build technologies, it is relatively easy to generate lists of dependencies directly (e.g., the Maven Dependencies plugin). We’ve been working with Maven-, Gradle-, NPM-, and Yarn- based systems: there are some specific examples in the prototype tool’s README (I’ll walk through some specific examples in future posts). Project teams that use technologies that do not provide easily parsed/converted dependency lists can manually generate a list and feed that to the tool. Sorting out how we make this work for more technologies is on the to-do list.

It’s important to keep in mind that committers are ultimately responsible to ensure that all intellectual property is accounted for properly (e.g., if you have third party content buried in product content, you’ll likely need to account for that manually). If your build technology does not completely or accurately report the list of dependencies, then the committers need to account for that. Furthermore, it’s important to keep in mind that the License Tool is intended to assist committers in determining whether or not further scrutiny is required for their project dependencies; while our early work with it shows that it produces good results, those results should not be blindly accepted as authoritative (the tool is only as good as the input which which it is provided). If something doesn’t seem right, please open an issue so that we can address it.

Edit (May 29/2020): we’ve revised the acceptance criteria for content listed on ClearlyDefined to “a score of at least 75 and all discovered licenses are on our approved licenses list”

Revising the Eclipse IP Due Diligence Process for Third Party Content

Don't forget to share this post!