Webpack and yarn magic against duplicates in bundles
This page describes the theory and some technical details behind the webpack-deduplication-plugin plugin, which helped us reduce javascript size in Jira by ~10%.
Setting up the scene
In order to not to turn this article into a book on modern dependencies and bundling tools, it is assumed that the reader has some understanding of what tools like npm, yarn and webpack are used for in modern frontend projects and want to dig a bit deeper in the magic behind the scenes.
Some terminology used in the article:
Direct dependencies: packages on which your project relies explicitly. Typically installed via yarn add package-name. The full list of those can be found in the dependencies field in package.json at the root of the project.
Transitive dependencies: packages on which your project relies implicitly. Those are the dependencies on which your direct dependencies rely on. Typically you won’t see them in package.json, but they can be seen, for example, in yarn.lock file.
Duplicated dependencies: transitive dependencies with mismatched versions. If one of the project dependencies has a button package version 4.0.0 as a transitive dependency and another has the same button version 3.0.0, both of those versions will be installed and the button dependency will be duplicated.
De-duplication: the process of elimination of duplicated dependencies according to their semver versions (x.x.x — major.minor.patch). Typically, a range of versions within the same major version will contain no breaking changes and only the latest version within this range can be installed. For example, a button version 4.0.0 and 4.5.6 can be “de-duplicated” and only 4.5.6 version will be installed.
yarn.lock file: an auto-generated file that contains the exact and full list of all direct and transitive dependencies and their exact versions in yarn-based projects.
The problem of duplicated dependencies
“Duplicated” dependencies in any of the middle- to large scale projects that rely on npm packages are inevitable. When a project has dozens “direct” dependencies, and every one of those has their own dependencies, the final number of all packages (direct and transitive) installed in a project can be close to hundreds. In this situation, it is more likely than not that some of the dependencies will be duplicated.
Considering that those are bundled together and served to the customers, in order to reduce the final javascript size it is important to reduce the number of duplicates to a minimum. This is where the deduplication process comes into play.
Deduplication in yarn
Consider, for example, a project, that among its direct dependencies has modal-dialog@3.0.0 and button@2.5.0, and modal-dialog brings button@2.4.1 as a transitive dependency. If left unduplicated, both buttons will exist in the project
and in yarn.lock we will see something like this:
modal-dialog@^3.0.0:
version "3.0.0"
resolved "exact-link-to-where-download-modal-dialog-3.0.0-from"
dependencies:
button@^2.4.1
button@^2.5.0:
version "2.5.0"
resolved "exact-link-to-where-download-2.5.0-version-from"
button@^2.4.1:
version "2.4.1"
resolved "exact-link-to-where-download-2.4.1-version-from"
Now, we know that according to semver button@2.4.1 and button@2.5.0 are compatible, and therefore we can tell yarn to grab the same button@2.5.0 version for both of them — “deduplicate” them. From the project perspective it will look like this:
and in yarn.lock file we’ll see this:
modal-dialog@^3.0.0:
version "3.0.0"
resolved "exact-link-to-where-download-modal-dialog-3.0.0-from"
dependencies:
button@^2.4.1
button@^2.4.1, button@^2.5.0:
version "2.5.0"
resolved "exact-link-to-where-download-2.5.0-version-from"
Deduplication in yarn — not compatible version
The above deduplication technic is the only thing that we usually have in the fight against duplicates, and usually, it works quite well. But what will happen if a project has not-semver-dedupable transitive dependencies? If, for example, our project has modal-dialog@3.0.0, button@2.5.0 and editor@5000.0.0 as direct dependencies, and those bring button@1.3.0 and button@1.0.0 as transitive dependencies?
Using the same technique, we can de-duplicate buttons from 1.x.x version, and from the project perspective it will look like this:
And in yarn.lock file we will see this:
modal-dialog@^3.0.0:
version "3.0.0"
resolved "exact-link-to-where-download-modal-dialog-3.0.0-from"
dependencies:
button@^1.0.0
editor@^5000.0.0:
version "5000.0.0"
resolved "exact-link-to-where-download-editor-5000.0.0-from"
dependencies:
button@^1.3.0
button@^2.5.0:
version "2.5.0"
resolved "exact-link-to-where-download-2.5.0-version-from"
button@^1.0.0, button@^1.3.0:
version "1.3.0"
resolved "exact-link-to-where-download-1.3.0-version-from"
Two versions of buttons are unavoidable, and in this case, usually, there is nothing we can do other than upgrading the versions of modal-dialog and editor to the versions when they both have button from 2.x.x range and it can be de-duplicated properly. Typically, in this case, we stop, say that our project has “2 versions of buttons” and move on with our lives.
But what if we dig a little bit further and check out how exactly those 2 buttons are installed on disk and bundled together?
Duplicated dependencies install
When we install our dependencies via classic yarn or npm (pnpm or yarn 2.0 change the situation and are not considered here), npm hoists everything that is possible up to the root node_modules. If, for example, in our project above both editor and modal-dialog have a dependency on the same “deduped” version of tooltip, but our project does not, npm will install it at the root of the project.
and inside node_modules folder we’ll see this structure:
/node_modules
/editor
/modal-dialog
/tooltip
And because of that, we can be sure that we only have one version of tooltip in the project, even if two completely different dependencies depend on slightly different versions of it.
Unless…
Unless those versions are not semver compatible and can not be deduped that easily 😬 Basically, the situation in the project with buttons from the above will look like this:
Even if dependencies are “deduped” on yarn.lock level and we “officially” have only 2 versions of buttons in yarn.lock, every single package with button@1.3.0 as a dependency will install its own copy of it.
Duplicated dependencies and webpack
So, if a project is bundled with webpack, how exactly it handles the situation from above? It doesn’t actually (there was webpack dedup plugin in the long past, but it was removed after webpack 2.0).
Webpack behind the scenes just builds a graph of all your files and their dependencies based on what’s installed and required in your node_modules via normal node resolution algorithm. TL;DR: every time a file in editor does “import Button from ‘button’;”, node will try to find this button in the closest node_modules starting from the parent folder of the file the request appeared. The same story with the modal-dialog. And then, from webpack perspective, the very final ask for the button will be:
- project/node_modules/editor/node_modules/button/index.js — when it’s requested from within editor
- project/node_modules/modal-dialog/node_modules/button/index.js — when it’s requested from within modal-dialog
Webpack is not going to check whether they are exactly the same, will treat them as unique files and bundle both of then in the same bundle. Our “duplicated” button just got double duplicated.
Deduplication in webpack — first attempt
Since those buttons are exactly the same, the very first question that comes into mind: is it possible to take advantage of that and “trick” webpack into recognising it? And indeed, it is possible and it is ridiculously simple.
Webpack is incredibly flexible, it provides rich plugin interface with access to almost everything you can imagine (and to some things that you can not), at its core, most of its features are built with plugins as well, and it exports a lot of them for others to use.
One of those plugins is NormalModuleReplacementPlugin — it gives the ability to replace one file with another file during build time based on a regular expression. Which is exactly what we need! The rest is just a matter of coding.
First, detect all “duplicated” dependencies by grabbing a list of all packages within node_modules and filtering those that have node_modules in their install path more than once (basically all the “nested” packages from the yarn install chapter above), and group them by their version from package.json.
Second, replace all encounters of the “same” package with the very first one from the list
And 💥, there is no “third”, the solution works, it’s safe, and reduces bundle sizes in Jira by ~10%.
The full implementation was literally just 100 lines. Be mindful with celebrating and copying the approach though, this is not the end of the article 😉
Deduplication in webpack — actual solution
While the solution above worked good and safe (we did vigorous testing of it before releasing it to production), it had an unfortunate side-effect: webpack started to generate assets in non-deterministic way 🤦♀ On every single re-build it was either moving some pieces of those “duplicated” modules around or was just generating new internal ids. Any possible reason within the code from our side was eliminated quite fast. Something weird was happening within webpack internals themselves.
Debugging what the hell is going on, understanding why and releasing a solution that fixes it for good took a week, deep dive into the internals of NormalModuleReplacementPlugin and webpack itself and would take another huge article to describe properly. I will try to list here the most interesting things about yarn and webpack (in no particular order) that were discovered along the way and could be useful for others to know.
Findings and other curiosities
The non-deterministic behaviour is reproducible non-deterministically
If you try to reproduce the non-deterministic part on a small synthetic example, most likely you won’t be able to. I was only able to do so it in my toy repo when I imported the entire @atlaskit/editor to it, just a combination of a few Atlaskit components didn’t do it. Reducing the chunk size to a minimum to make webpack split code, manual async imports also didn’t help.
Interestingly, the order in which the hook that you need to listen to in order to override requests for files is always different, but the final assets on small examples (and without deduping for that matter) are deterministic.
NormalModuleReplacementPlugin was not built for the purpose.
First of all, it executes the RegExp only on request property of the result and replaces only request as well (check out the source). However, there are many more properties in the result object that contain information about the origin of the module (and in theory need to be replaced as well), one of which is context — where the request actually originated from. And if a file is requested relatively, it will have a relative path in the request property as well.
{
request: "./styled",
context: "/project/node_modules/editor/node_modules/button"
}
The final “path” to the file is a resolution of both, and in order to properly detect “duplicates”, we need to watch them both and replace all the fields that have the “duplicated” information (including context), which NormalModuleReplacementPlugin does not do.
Nevertheless, NormalModuleReplacementPlugin can actually be used if there is a need
Because everything that is said in the “finding” above is not entirely correct, and what turned out to be enough in the end is to replace just request with the resolved absolute path from both of those.
From this:
{
request: "./styled",
context: "/project/node_modules/editor/node_modules/button"
}
it transforms into this
{
request: "/project/node_modules/modal-dialog/node_modules/button/styled",
context: "/project/node_modules/editor/node_modules/button"
}
and webpack is okay with it and able to correctly bundle it 😲
“naive” replace (string.startsWith) is not going to work
Even if in theory if packages are “the same”, in reality, they are not, and the difference is called “transitive dependencies of transitive dependencies”. The “simplest” example of the use case would be:
- button@2.x and icon@3.x at the root
- editor has button@1.x as a transitive dependency, which has icon@1.x as a transitive dependency
- modal-dialog has button@1.x as a transitive dependency, which has icon@1.x as a transitive dependency PLUS it has icon@2.x as its own transitive dependency (those who survived and read til this moment — kudos to you, you’re heroes)
this will be represented as the following folder structure:
/node_modules
/editor
/node_modules
/button-1.3.0
/icon-1.0.0 // on the same level as button above
/modal-dialog
/node_modules
/button-1.3.0
/node_modules
/icons-1.0.0 // nested within button since on the lvl above there is another icon
/icon-2.0.0
/button-2.5.0
and final requests to icon@1.x will be:
/project/node_modules/editor/node_modules/icon-1.0.0
/project/node_modules/modal_dialog/node_modules/button-1.3.0/node_modules/icon-1.0.0
considering that button@1.x is a duplicate, we need to replace button in modal_dialog with the button from editor. And just “naive” startsWith will replace
/project/node_modules/modal_dialog/node_modules/button-1.3.0
with
/project/node_modules/editor/node_modules/button-1.3.0
and the path to the last icon will be transformed into
/project/node_modules/editor/node_modules/button-1.3.0/node_modules/icon-1.0.0
but there is no icon at this path.
The end
So, what was the final reason for non-deterministic behaviour? It is actually a combination of the:
- non-deterministic order of hooks on the webpack side
- “naive” initial strings replacement
- listening and replacing only “request” in the initial implementation
- a few other edge cases that are not mentioned here and that caused not all modules to be resolved
Now, that it’s solved, and the plugin was battle-tested in Jira, everyone else can use it too and shrink their bundles a bit. In Jira it gave us ~10% overall of the bundle size reduction and ~300ms TTI improvement in the Issue View page. The plugin is available here: https://github.com/atlassian-labs/webpack-deduplication-plugin