Sunday, June 12, 2011

Converting to Bazaar from CVS Take 2 (with some EC2 help)

I decided to take another look at bzr and converting cvs after getting the cvsps-import to work. I got a bunch of advice from bzr experts that fast import is really what I wanted to be looking at when dealing with a large repository.

Step 1 is to setup a large EC2 instance using the ubuntu Natty Narwhal kindly supplied by ubuntu.

I recreated my cvs server directory structure on the ec2 server and uploaded my cvs repository. One note I did need the CVSROOT information as well as the directory of the module I was converting. I now have a nice beefy ubuntu server that looks (to fast-export-from-cvs) like my cvs server. You don't need to actually get it to work as a cvs server, as the conversion tools read the files directly.

Next install the software that will be needed by the conversion.
sudo apt-get install bzr
mkdir -p ~/.bazaar/plugins/
cd ~/.bazaar/plugins/
bzr branch lp:bzr-fastimport fastimport
bzr branch lp:python-fastimport
cd python-fastimport/
sudo python install
sudo apt-get install cvs2svn
sudo apt-get install cvs
And now I was ready to give it a try:
bzr fast-export-from-cvs /mycvsroot/module ~/
I am back to my old friend:
ERROR: Git output requires a default commit username
It turns out that the bzr fast-export-from-cvs is basically just a wrapper for the cvs2bzr program. You can hack around the code to get it to work with a properties file if you wish, but you may also call cvs2bzr directly and then work with the generated "fi" file.

I modified the cvs2bzr-example.options file for cvs2bzr from the source distribution of cvs2bzr (see here for an explanation of command line options vs options file option). I set the fallback_encoding to be ascii on lines 164 and 172; set the source dir to be my cvs repository path including the module name on line 524; finally setup the branches I wanted to be converted.

I did not need most of the branches or tags that existed in the repository I was converting and the time to do them all was the difference of many hours (something I can do overnight vs something I need a weekend for) and an export size difference of 74 gigs to 6 gigs. To get cvs2bzr to only import certain branched you will need to do it backwards so to speak. There is a way of forcing branches to use and another to exclude branches. So you must exclude everything and then specifically force what you want; which becomes the following two lines in the options file where "BRANCH_TO_INCLUDE" is the branch name you want to include:
This will skip the tags as well, but you will need to deal with tags existing on different branches and I didn't need or want them.

You can also use the ctx.trunk_only option to only import the trunk if this suits your particular situation.

Here is the full diff of cvs2svn-example.options (scrubbed a little):
< fallback_encoding='ascii' --- > #fallback_encoding='ascii'
< fallback_encoding='ascii' --- > #fallback_encoding='ascii'
< ForceBranchRegexpStrategyRule(r'BRANCH_TO_INCLUDE.*'), --- > #ForceBranchRegexpStrategyRule(r'branch.*'),
< ExcludeRegexpStrategyRule(r'.*'), --- > #ExcludeRegexpStrategyRule(r'unknown-.*'),
< '/mnt/', --- > 'cvs2svn-tmp/',
< r'/mycvsroot/module', --- > r'test-data/main-cvsrepos',

Now run the export and import. This will export it to a file /mnt/ (which is where I set it to get from above) and then in a separate step will import it into bzr.
cvs2bzr --options=cvs2bzr-example.options
cd /mnt
mkdir module
cd module
bzr init-repo .
cat ../ | bzr fast-import -
You now have a new bzr shared repo in the module folder.

Using a large amazon EC2 instance the entire conversion process took my repository about 3 hours. I highly recommend getting a fresh EC2 instance running a current version of ubuntu and therefore python and bzr. These conversion tools are hit or miss if they will work with your specific version of bzr, but I had 100% better experience running on the latest 11.04 (Natty Narwhal). Furthermore the fast import is much more robust than the cvsps-import module, but is also more complicated. If you have smaller repository try cvsps-import first, if that does not work or if you find this taking too long punt early and move to the industrial grade cvs2svn/cvs2bzr solution. My next step is to set this up to automatically run nightly with a boot strapped puppet setup!