Finding and Purging Big Files From Git History

2012/01/17

On a recent grails project, we’re using a git repo that was originally converted from a SVN repo with a ton of large binary objects in it (lots of jar files that really should come from an ivy/maven repo). The .git directory was over a gigabyte in size and this made it very cumbersome to clone and manipulate.

We decided to leverage git’s history rewriting capabilities to make a much smaller repository (and kept our previous repo as a backup just in case).

Here are a few questions/answers that I figured out how to answer with git and some shell commands:

What object SHA is associated with each file in the Repo?

Git has a unique SHA that it associates with each object (such as files which it calls blobs) throughout it’s history.

This helps us find that object and decide whether it’s worth deleting later on:

git rev-list --objects --all | sort -k 2 > allfileshas.txt

Take a look at the resulting allfileshas.txt file for the full list.

What Unique Files Exist Throughout The History of My Git Repo?

If you want to see the unique files throughout the history of your git repo (such as to grep for .jar files that you might have committed a while ago):

    git rev-list --objects --all | sort -k 2 | cut -f 2 -d\  | uniq

How Big Are The Files In My Repo?

We can find the big files in our repo by doing a git gc which makes git compact the archive and stores an index file that we can analyse.

Get the last object SHA for all committed files and sort them in biggest to smallest order:

git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt

Take that result and iterate through each line of it to find the SHA, file size in bytes, and real file name (you also need the allfileshas.txt output file from above):

for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print $1,$3,$7}' >> bigtosmall.txt
done;

(there’s probably a more efficient way to do this, but this was fast enough for my purposes with ~50k files in our repo)

Then, just take a look at the bigtosmall.txt file to see your biggest file culprits.

Purging a file or directory from history

Use filter-branch to remove the file/directory (replace MY-BIG-DIRECTORY-OR-FILE with the path that you’d like to delete relative to the root of the git repo:

git filter-branch --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch MY-BIG-DIRECTORY-OR-FILE' --tag-name-filter cat -- --all

Then clone the repo and make sure to not leave any hard links with:

git clone --no-hardlinks file:///Users/yourUser/your/full/repo/path repo-clone-name

You can use this command from the parent directory that contains your git repository and it’s clone to see how much space each of them take, and how much you’ve shrunk the repo in size:

du -s *(/)     # add the -h flag to see the output in human readable size formats, just like ls -lah vs ls -la

With these commands, I was able to reduce the file size of our repo with a few thousand commits down below the size of the checked out repository (more than an order of magnitude smaller). I only removed old binary files, we still have full history for all code files.

There are 3 comments in this article:

  1. 2012/01/17David say:

    Great post! Thanks. Will try this out as we’re in a similar situation. Exported svn-repo and a bunch of big binaries…

  2. 2012/02/8Zach Harkey say:

    Great article. Worked perfectly.

    Incidently, if you accidentally mess something up you can still use git reflog to restore your branch to its state before your filter-branch:

    git reset head@{1}

  3. 2012/02/8tednaleid say:

    @Zach Harkey – that’s a good point about using `git reset HEAD@{1}` to get back to where HEAD was pointing before the `filter-branch` command.

    Just a warning to other readers about this is that the reflog is repo specific, and the clone of you repo with the commands above will not have the old information (which is why we’re doing it :). So you’d want to make sure to keep your previous repo until you’re sure you’re happy with the result in the new repo.

Write a comment: