Karl and I have (mostly) worked out a redesign of cvs2svn that is a bit more involved than just retooling the current implementation of the SymbolicNameTracker, and we think that it's a *huge* win over the current method of calculating tags and branches, even though it requires a bit more on-disk scratch area than the current implementation. *Please* take the time to review this document and point out any problems or inconsistencies that you might find. To reiterate our problem: We need to drastically improve the speed and disk-space requirements of converting a CVS repository with numerous tags and branches. In brief, we propose to: - Gather CVS revisions into Subversion revisions (a Subversion revision is comprised of one or more CVS revisions) in a separate pass before we actually begin committing (where "committing" means either to a Subversion repository or to a dump file). - Preprocess branch and tag data during this pass, using a sort trick to keep relevant data together. This avoids the problem with the current SymbolicNameTracker -- namely the fact that it has to handle a *huge* amount of (mostly) duplicated data all in one great big database file. The Details: ------------ We'll create several new passes and intermediate files as well as completely eliminate the cvs2svn-sym-names.db file as it now stands. First, we create a new pass, let's call it 3.1 for the moment. This pass will iterate through the s-revs file, doing several unrelated things: 1. Determine the closing Subversion revisions for each tag and branch. Yes, this is basically get_symbol_closing_revs(). We write these out to a database in the following format: cvs-symname-closing-revs.db: Key Value CVS Revision array of Symbolic names For example: 1.38/foo/bar/baz.txt,v --> [TAG11, BRANCH38] 1.93/foo/qux/bat.c,v --> [TAG39] 1.4/foo/bar/baz.txt,v --> [BRANCH48, BRANCH37] 1.18/foo/bar/quux.txt,v --> [TAG320, TAG1178] 2. Create a database that maps CVSRevision.unique_key() to the actual CVS Revision data. This gives us the ability to pass around these smaller keys instead of whole CVS revisions (which look like lines from the s-revs file). See the CVSRevision class for more details on what the unique key is. Next, we create a new pass, let's call it 3.3. This pass will iterate through the s-revs file, doing several other unrelated things: 1. Begin creating the RepositoryMirror here--but only the trunk of the repository. We will fill branches and tags into the RepositoryMirror in pass 3.7 (described below). Also, we will no longer record information about which leaf nodes in the RepositoryMirror are on which branch or tag (currently, we store the name of the tag/branch in a dictionary in the leaf node, and that's not only pricey in terms of reading, writing, and marshalling to and from the database, but it makes the nodes.db HUGE). NOTE: Can we do this in pass3.1? 2. Generate Subversion commit groups. As we iterate through the s-revs file, we accumulate CVS Revisions into commit groups. We write these into two tables: svn-nums-to-cvs-revs.db: Key Value Subversion revnum array of CVS Revisions cvs-revs-to-svn-nums.db: Key Value CVS Revision Subversion revnum This gives us a bi-directional mapping between Subversion and CVS Revision numbers. We'll use the same algorithms that we currently use in pass4 to determine which Subversion revisions will accommodate branch and tag creation and any other "special" revisions. One implementation detail is that we'll use the CVSRevision unique_key() to reference CVS Revisions instead of passing the whole revision around. 3. For each CVS Revision in s-revs, we write out a line to a symbolic-names.txt file if it is the first possible source revision (the "opening" revision) for a copy to create a branch or tag, or if it is the last possible revision (the "closing" revision) for a copy to create a branch or tag. The format of each line is: SYMBOLIC_NAME CVSRevision.unique_key() For example: MY_TAG1 1.3/foo/bar/baz.txt,v MY_BRANCH3 1.13/foo/qux/bat.c,v MY_TAG1 1.4/foo/bar/baz.txt,v MY_BRANCH_BLAH 1.1/foo/bar/quux.txt,v Here is what the columns mean: SYMBOLIC_NAME: The name of the branch or tag that starts in this CVS Revision (There can be multiples per CVS rev) CVSRevision.unique_key(): This is a unique key that identifies the CVSRevision where this opening or closing happened. Determining opening and closing Subversion revisions: Since we're assembling the mapping of CVS Revisions to Subversion revisions (above in step 2), for now, we'll merely refer to the CVS Revision, and later use the mapping file to find the corresponding Subversion revision for each opening and closing. Finding the opening revision for a source path on a branch or tag: The line in the s-revs file that lists the corresponding tag(s) and branch(es) corresponds to the opening CVS Revision for that file. Finding the closing revision for a source file on a branch or tag: This is a little more complex. In order to find the closing rev of a source path, we need to record the Subversion[1] path of the opening, and the symbolic name that it opens. Now given that we can have multiple symbolic names opening in a single CVS Revision, we'll have to map each Subversion path to multiple symbolic names (similar to cvs-symname-closing-revs.db): svn-opening-names.db: Key Value Subversion path array of Symbolic names For example: trunk/foo/bar/bat.c --> [TAG11, BRANCH38] trunk/qux/quux.txt --> [TAG39] branches/mybranch1/a.c --> [BRANCH48, BRANCH37] [1] We need to use the Subversion path here to account for branches and tags of files that are already on a branch. For each CVS Revision, we calculate the Subversion path, and check for the existence of that path in svn-opening-names.db, and if it's in there, then we check to see if it's value contains the symbolic name that we're currently working on. If it does, then we remove that symbolic name from the list of values and record our current CVS Revision in symbolic-names.txt as the closing revision. Now for another pass that we will perform before pass 4. I'll call this one 3.5. Here we sort the symbolic-names.txt file into symbolic-names-s.txt using GNU sort. :-) .-----------------------------------------------------------------------. | Side discussion: Dealing With Large Files: | | | | Both symbolic-names.txt and symbolic-names-s.txt are large files, | | approximately | | | | NUM_SYMBOLIC_NAMES * NUM_PATHS * AVG_LINE_LENGTH | | (AVG_LINE_LENGTH = AVG_RCS_PATH_LENGTH + AVG_SYMBOLIC_NAME_LENGTH) | | | | In a repository with 30,000 files and 3000 branches or tags, that's | | be approximately 90,000,000 lines. Let's be pessimistic and say | | the average line length is 50 bytes, then each file will be | | 4,500,000,000 bytes (appx 4.5 Gigs). Clearly, this is going to | | hurt! There are various strategies for dealing with this. We can | | split the data into multiple files, sort the individual files, | | then in the steps below, where we talk about reading just one | | file, we'd actually read N files. | | | | ### Karl, I think that your estimate of 50 bytes is rather | I hate these freaking | ### conservative. Also, GNU Sort provides a merge command for | | ### merging a number of pre-sorted files. See +----------------. | ### http://www.gnu.org/software/textutils/manual/textutils/html_node/textutils_23.html | | ### for more info. +----------------' | | | Another approach would be to use surrogate keys for symbolic | text boxes. | names as well as cvs revisions and provide a mapping table for | | both. For example, we might have 2 database tables with schemas | | like this: | | | | KEY CVS Revision | | 143a 1.4/foo/bar/baz.txt,v | | 1d4c 1.9/foo/bat.c,v | | | | KEY Symbolic Name | | 1f MY_REALLY_LONG_TAG_NAME | | 8 SOME_BRANCH_NAME_HERE | | | | That would allow us to use the shorter keys in place of the | | actual values of the CVS Revision or Symbolic Name in question. | | And since we're only talking about 30,000 files and 3,000 tags in | | our examples, these values would remain very small (under 16 | | bytes max). So, with 90,000,000 lines of 32 bytes each, we're | | looking at about 2.8GB--still a lot of data. | | | | Lastly, remember that we're discussing the pathological case | | here--assuming that *every* file in the repository is on every | | branch and every tag. | | | | For now, we'll talk as though there's only one symbolic-names.txt | | file, because things are easier to explain that way. | `-----------------------------------------------------------------------' The sorted file has two useful properties: - All the lines for a given symbolic name are together. - Since the closing revision is always >= the opening revision (for a given path and symbolic name), closing revs always come after opening revs. And now for Yet Another Pass that we will perform before pass 4. I'll call this one 3.7. Because symbolic-names-s.txt groups each symbolic name together, we can read all the paths for one symbolic name into memory easily. So, we read through symbolic-names-s.txt and accumulate *all* the paths for a given symbolic name into an in-memory data structure, replacing the opening and closing CVS revs with Subversion revs as per the lookup in cvs-revs-to-svn-nums.db. The in-memory data structure is basically the same as the tree currently stored in cvs2svn-sym-names.db under that symbolic name -- only now it won't be in a database anymore. We can do this, because any single tree fits into memory, we just need to avoid having them *all* in memory at once. We now score this entire in-memory tree, using the same algorithm as we currently use for the on-disk database. This will go much faster, because we'll be working in memory. The output of our tree scoring will be a set of instructions in a small "copying language", telling us how to create that symbolic name in Subversion. For example, To create branch NAME: copy 7:/trunk/foo /branches/NAME/foo del /branches/NAME/foo/bar/baz.c del /branches/NAME/foo/qux copy 19:/trunk/foo/qux /branches/NAME/foo/qux etc, etc. These sets of instructions are written to disk, one file per symbolic name. They replace the extremely disk-hungry and disk-intensive cvs2svn-sym-names.db, by expressing only the information needed to create that symbolic name (i.e., no more node duplication). Then, analogous to Subversion's delta-editor, we will have a "copy editor" that can parse such files, and run the appropriate series of operations to create the branch or tag. When it's time to create a tag or branch, we just invoke the copy editor on the appropriate file: 'copy_editor.run(cmdfile, dumper)' or something like that. Changes to pass4 The 4th pass will now have very little "thinking" to do--it's basically going to open the svn-nums-to-cvs-revs.db and, starting with Subversion revision 1, and sequentially play out all the commits to either the Subversion repository or to a dumpfile. After each commit, the "copy editor" will be passed the current Subversion revision number, and it will perform any copy commands that need to be done (as described in pass 3.7).