Karl and I have (mostly) worked out a redesign of cvs2svn that is a
bit more involved than just retooling the current implementation of
the SymbolicNameTracker, and we think that it's a *huge* win over the
current method of calculating tags and branches, even though it
requires a bit more on-disk scratch area than the current
implementation.

*Please* take the time to review this document and point out any
 problems or inconsistencies that you might find.


To reiterate our problem: We need to drastically improve the speed and
disk-space requirements of converting a CVS repository with numerous
tags and branches.

In brief, we propose to:

   - Gather CVS revisions into Subversion revisions (a Subversion
     revision is comprised of one or more CVS revisions) in a separate
     pass before we actually begin committing (where "committing"
     means either to a Subversion repository or to a dump file).

   - Preprocess branch and tag data during this pass, using a sort
     trick to keep relevant data together.  This avoids the problem
     with the current SymbolicNameTracker -- namely the fact that it
     has to handle a *huge* amount of (mostly) duplicated data all in
     one great big database file.

The Details:
------------

We'll create several new passes and intermediate files as well as
completely eliminate the cvs2svn-sym-names.db file as it now stands.

First, we create a new pass, let's call it 3.1 for the moment.  This
pass will iterate through the s-revs file, doing several unrelated
things:

  1. Determine the closing Subversion revisions for each tag and
     branch.  Yes, this is basically get_symbol_closing_revs().  We
     write these out to a database in the following format:

       cvs-symname-closing-revs.db:
            Key                      Value    
            CVS Revision             array of Symbolic names
     
       For example:
 
            1.38/foo/bar/baz.txt,v  --> [TAG11, BRANCH38]
            1.93/foo/qux/bat.c,v    --> [TAG39]
            1.4/foo/bar/baz.txt,v   --> [BRANCH48, BRANCH37]
            1.18/foo/bar/quux.txt,v --> [TAG320, TAG1178]

  2. Create a database that maps CVSRevision.unique_key() to the
     actual CVS Revision data.  This gives us the ability to pass
     around these smaller keys instead of whole CVS revisions (which
     look like lines from the s-revs file).  See the CVSRevision class
     for more details on what the unique key is.


Next, we create a new pass, let's call it 3.3.  This pass will iterate
through the s-revs file, doing several other unrelated things:

  1. Begin creating the RepositoryMirror here--but only the trunk of
     the repository.  We will fill branches and tags into the
     RepositoryMirror in pass 3.7 (described below).  Also, we will no
     longer record information about which leaf nodes in the
     RepositoryMirror are on which branch or tag (currently, we store
     the name of the tag/branch in a dictionary in the leaf node, and
     that's not only pricey in terms of reading, writing, and
     marshalling to and from the database, but it makes the nodes.db
     HUGE). NOTE: Can we do this in pass3.1?

  2. Generate Subversion commit groups.  As we iterate through the
     s-revs file, we accumulate CVS Revisions into commit groups.  We
     write these into two tables: 

       svn-nums-to-cvs-revs.db:
            Key                      Value    
            Subversion revnum        array of CVS Revisions

       cvs-revs-to-svn-nums.db:
            Key                      Value    
            CVS Revision             Subversion revnum

     This gives us a bi-directional mapping between Subversion and CVS
     Revision numbers.  We'll use the same algorithms that we
     currently use in pass4 to determine which Subversion revisions
     will accommodate branch and tag creation and any other "special"
     revisions.  One implementation detail is that we'll use the
     CVSRevision unique_key() to reference CVS Revisions instead of
     passing the whole revision around.  

  3. For each CVS Revision in s-revs, we write out a line to a
     symbolic-names.txt file if it is the first possible source
     revision (the "opening" revision) for a copy to create a branch
     or tag, or if it is the last possible revision (the "closing"
     revision) for a copy to create a branch or tag.

     The format of each line is:

         SYMBOLIC_NAME CVSRevision.unique_key()
         
     For example:
 
         MY_TAG1 1.3/foo/bar/baz.txt,v
         MY_BRANCH3 1.13/foo/qux/bat.c,v
         MY_TAG1 1.4/foo/bar/baz.txt,v
         MY_BRANCH_BLAH 1.1/foo/bar/quux.txt,v

     Here is what the columns mean:

     SYMBOLIC_NAME: The name of the branch or tag that starts in this
                    CVS Revision (There can be multiples per CVS rev)

     CVSRevision.unique_key(): This is a unique key that identifies
                               the CVSRevision where this opening or
                               closing happened.
                               
     Determining opening and closing Subversion revisions:
     
       Since we're assembling the mapping of CVS Revisions to
       Subversion revisions (above in step 2), for now, we'll merely
       refer to the CVS Revision, and later use the mapping file to
       find the corresponding Subversion revision for each opening and
       closing.

       Finding the opening revision for a source path on a branch or
       tag:

         The line in the s-revs file that lists the corresponding
         tag(s) and branch(es) corresponds to the opening CVS Revision
         for that file.

       Finding the closing revision for a source file on a branch or
       tag:

         This is a little more complex.  In order to find the closing
         rev of a source path, we need to record the Subversion[1]
         path of the opening, and the symbolic name that it opens.
         Now given that we can have multiple symbolic names opening in
         a single CVS Revision, we'll have to map each Subversion
         path to multiple symbolic names (similar to
         cvs-symname-closing-revs.db):

           svn-opening-names.db:
                Key                      Value    
                Subversion path          array of Symbolic names
         
           For example:
     
                trunk/foo/bar/bat.c     --> [TAG11, BRANCH38]
                trunk/qux/quux.txt      --> [TAG39]
                branches/mybranch1/a.c  --> [BRANCH48, BRANCH37]

           [1] We need to use the Subversion path here to account for
               branches and tags of files that are already on a
               branch.

         For each CVS Revision, we calculate the Subversion path, and
         check for the existence of that path in svn-opening-names.db,
         and if it's in there, then we check to see if it's value
         contains the symbolic name that we're currently working on.
         If it does, then we remove that symbolic name from the list
         of values and record our current CVS Revision in
         symbolic-names.txt as the closing revision.
         

Now for another pass that we will perform before pass 4.  I'll call
this one 3.5.  Here we sort the symbolic-names.txt file into
symbolic-names-s.txt using GNU sort. :-)

   .-----------------------------------------------------------------------.
   | Side discussion: Dealing With Large Files:                            |
   |                                                                       |
   | Both symbolic-names.txt and symbolic-names-s.txt are large files,     |
   | approximately                                                         |
   |                                                                       |
   |    NUM_SYMBOLIC_NAMES * NUM_PATHS * AVG_LINE_LENGTH                   |
   |    (AVG_LINE_LENGTH = AVG_RCS_PATH_LENGTH + AVG_SYMBOLIC_NAME_LENGTH) |
   |                                                                       |
   | In a repository with 30,000 files and 3000 branches or tags, that's   |
   | be approximately 90,000,000 lines.  Let's be pessimistic and say      |
   | the average line length is 50 bytes, then each file will be           |
   | 4,500,000,000 bytes (appx 4.5 Gigs).  Clearly, this is going to       |
   | hurt!  There are various strategies for dealing with this.  We can    |
   | split the data into multiple files, sort the individual files,        |
   | then in the steps below, where we talk about reading just one         |
   | file, we'd actually read N files.                                     |
   |                                                                       |
   | ### Karl, I think that your estimate of 50 bytes is rather            |  I hate these freaking
   | ### conservative.  Also, GNU Sort provides a merge command for        |
   | ### merging a number of pre-sorted files.  See                        +----------------.
   | ### http://www.gnu.org/software/textutils/manual/textutils/html_node/textutils_23.html |
   | ### for more info.                                                    +----------------'
   |                                                                       |
   | Another approach would be to use surrogate keys for symbolic          |  text boxes.
   | names as well as cvs revisions and provide a mapping table for        |
   | both. For example, we might have 2 database tables with schemas       |
   | like this:                                                            |
   |                                                                       |
   |   KEY    CVS Revision                                                 |
   |   143a   1.4/foo/bar/baz.txt,v                                        |
   |   1d4c   1.9/foo/bat.c,v                                              |
   |                                                                       |
   |   KEY    Symbolic Name                                                |
   |   1f     MY_REALLY_LONG_TAG_NAME                                      |
   |   8      SOME_BRANCH_NAME_HERE                                        |
   |                                                                       |
   | That would allow us to use the shorter keys in place of the           |
   | actual values of the CVS Revision or Symbolic Name in question.       |
   | And since we're only talking about 30,000 files and 3,000 tags in     |
   | our examples, these values would remain very small (under 16          |
   | bytes max).  So, with 90,000,000 lines of 32 bytes each, we're        |
   | looking at about 2.8GB--still a lot of data.                          |
   |                                                                       |
   | Lastly, remember that we're discussing the pathological case          |
   | here--assuming that *every* file in the repository is on every        |
   | branch and every tag.                                                 |
   |                                                                       |
   | For now, we'll talk as though there's only one symbolic-names.txt     |
   | file, because things are easier to explain that way.                  |
   `-----------------------------------------------------------------------'

   The sorted file has two useful properties:

      - All the lines for a given symbolic name are together.

      - Since the closing revision is always >= the opening revision
        (for a given path and symbolic name), closing revs always come
        after opening revs.

And now for Yet Another Pass that we will perform before pass 4.  I'll
call this one 3.7.  Because symbolic-names-s.txt groups each symbolic
name together, we can read all the paths for one symbolic name into
memory easily.  So, we read through symbolic-names-s.txt and
accumulate *all* the paths for a given symbolic name into an in-memory
data structure, replacing the opening and closing CVS revs with
Subversion revs as per the lookup in cvs-revs-to-svn-nums.db.  The
in-memory data structure is basically the same as the tree currently
stored in cvs2svn-sym-names.db under that symbolic name -- only now it
won't be in a database anymore.  We can do this, because any single
tree fits into memory, we just need to avoid having them *all* in
memory at once.

We now score this entire in-memory tree, using the same algorithm as
we currently use for the on-disk database.  This will go much faster,
because we'll be working in memory.

The output of our tree scoring will be a set of instructions in a
small "copying language", telling us how to create that symbolic name
in Subversion.  For example,

    To create branch NAME:
       copy 7:/trunk/foo /branches/NAME/foo
       del /branches/NAME/foo/bar/baz.c
       del /branches/NAME/foo/qux
       copy 19:/trunk/foo/qux /branches/NAME/foo/qux
     
etc, etc.  These sets of instructions are written to disk, one file
per symbolic name.  They replace the extremely disk-hungry and
disk-intensive cvs2svn-sym-names.db, by expressing only the
information needed to create that symbolic name (i.e., no more node
duplication).

Then, analogous to Subversion's delta-editor, we will have a "copy
editor" that can parse such files, and run the appropriate series of
operations to create the branch or tag.  When it's time to create a
tag or branch, we just invoke the copy editor on the appropriate file:
'copy_editor.run(cmdfile, dumper)' or something like that.


Changes to pass4

The 4th pass will now have very little "thinking" to do--it's
basically going to open the svn-nums-to-cvs-revs.db and, starting with
Subversion revision 1, and sequentially play out all the commits to
either the Subversion repository or to a dumpfile.  After each commit,
the "copy editor" will be passed the current Subversion revision
number, and it will perform any copy commands that need to be done (as
described in pass 3.7).