Remove data from git permanently.
If you have this problem, here’s something you can try.
Just committed something that is not supposed to be committed? Is there a way to delete the commit history forever? If you are running into this situation, here’s something you can try.
If you are looking for a quick fix, you can go straight to the bottom.
While in my junior days, I was really bad at git. And I’ve done quite a lot of silly things with it.
Not saying I’m a git guru now. I still think that I’m just not capable enough to utilize the full potential of git. But I’m trying, after all this is the thing that I have to deal with daily.
Today, I want to try a couple of different ways to save myself if I just committed a crime…
I mean committed something that should not be committed.
Back to my junior days… I’ve been committed node_modules
that makes my repo became a total behemoth. Just when I was wondering about why it took so long to push to remote. Then I realized I’m uploading a multi-giga bytes of junk ball (probably not giga bytes, but I mean the size of node_module really fked up…).
And I also committed some sensitive data, like credentials or keys that are not supposed to be committed in git.
It’s totally fine if it’s just a fresh project. I can just delete it on remote and start over again. Or if it’s just on my local machine, I can just reset
it, and exclude the sensitive data and create a new commit.
But what if the project has been there for a long time? What if the commit has been there for months before someone or some static code analysis found there’s a credential inside the codebase?
If you’re working in a company which is able to afford a self-hosted private git registry. It’s probably not going to be too much of a deal. Because if the hacker is able to breach the registry, the credential you committed is probably the least of your concern.
But if you committed the data to GitHub, or some public registry. That my friend… you need to do something about it, ASAP. Because although something like GitHub has all the most talented programmers in the world, but it’s just not the safest place in the world.
But there’s no way to delete the whole thing and start over again. Or it’s just practically impossible to reset
to the point where the credential leaked.
So, how to fix it?
Git is basically a database, that stores all the commits. All the commits have its own hash, and all the files too. That is basically how we can go back in time with git. We can use the hash to find every file that has been committed.
Here I create a repo that has three commits. inside the first commit (the one with init as commit message) it includes two files index.js
and secret.js
(shown under the second command).
And if we git cat-file -p
the hash for secret.js
, we can see what’s inside the file at the very moment it gets committed.
OK, each commit and file have its own hash, and the hash serves as an index in the database. So, where’s is the database? It’s inside the .git/objects
folder in the repo.
If we use git cat-file -p
with the hash of secret.js
which start with 40aab7f...
. With this tree hash we can find it is point to the file .git/objects/40/aab7f...
So, can I just delete the git object? Yes and no…
If I just delete the git object, git can still work as expected. But in this case, the initial commit includes two files index.js
and secret.js
. If I use git diff 282867
it will throw an error fatal: unable to read 40aab7f...
. Or just use git fsck
to check the data integrity, the following error will pop up.
Apparently, this is definitely not a good way to do it.
So ideally, we need to go through every commit and find the file, remove it from git, and break all the links that pointed to the file in the database.
Yes, if I’m bright enough I can probably write a tool to do that. But I’m not…
So, what’s the options?
Built-In Solution (I)
There’s a built-in tool called filter-branch
, we can use this tool to remove the unwanted file from our git commit history.
git filter-branch --tree-filter "rm -f secret.js"
After you executed the command, it will prompt a warning message, and waited for a couple of second. Just give you some time before you regreted.
Once it finishes, we will find that all the hashes have been changed.
And if we inspect the first commit with git cat-file -p
again, the secret.js
has gone!
But as all programmer knows, one can’t not just simply delete anything from git forever. the reflog
is still there, anyone can just revert the file with reflog hash.
And we still have the original reference, you can also go back from there. This is not enough. We need to delete the original reference.
rm .git/refs/original/ref/heads/main
And the reflog
git reflog delete HEAD@{0}
Do notes that the reference is not always HEAD@{0}
, it depends on how many git command you’ve done after filter-branch
. You need to reflog
again to make sure you are targeting the right reference.
We can use git fsck
to check if there’s any unreachable objects. But in my case, there’s none. So, we can proceed to garbage collection.
git gc --prune=now
After the GC, git will pack all the objects, and if there’s any unreachable object, they will be collected and removed.
Recommended Way by Git (II)
First you will need to install filter-repo.
After installed. use the following command.
git filter-repo --invert-paths --path secret.js
This should be slightly faster, since it won’t warn us to think twice before it continues its work.
I have to use --force
flag, because it’s not a fresh clone. It’s a better practice to just clone a new one before the operation. Because you will lose all the stashed state after this.
If you inspect the hash with git cat-file -p
, it should have the same result. Every hash changed, and the secret.js
in the first commit should be gone. And it also clears the reflogs and done the garbage collection for you. It does feels more convenient.
Besides these approaches above, there’s some other software that can help us clean up our git history like BFG. You can also find people post their script that can do the similar thing.
Lastly don’t forget to force push the changes to the remote registry. After all, the whole point of this is trying to remove the data on the internet.
And as the post on GitHub stated. Once you uploaded any sensitive data to online registry, consider it’s already compromised. Change your password, invalidate the credential or whatever you just pushed to the online registry before removing it from git to keep your data safe.