
A proposal for an svn filesystem dump/restore format.

Two problems we want to solve
=============================

 1.  When we change our node-id schema, we need to migrate all of our
     data (by dumping and restoring).

 2.  Serves as a backup format.  Could be read by other software tools
     someday.


Design Goals
============

 A.  Written as two new public functions in svn_fs.h.  To be invoked
     by new 'svnadmin' subcommands.

 B.  Format uses only timeless fs concepts.

     The dump format needs to reference concepts that we *know* are
     general enough to never change.  These concepts must exist
     independently of any internal node-id schema, or any DB storage
     backend.  In other words, we're talking about the basic ideas in
     our original "design spec" from May 2000.


Format Semantics
================

Here are the timeless semantics of our fs design -- the things that
would be stored in our dump format.

  - A filesystem is an array of trees.
    Each tree is called a "revision" and has unversioned properties attached.

  - A revision has a tree of "nodes" hanging off of it.
    Actually, the nodes in the filesystem form a DAG.  A revision
    always points to an initial node that represents the 'root' of some tree.
 
  - The majority of a tree's nodes are hard-links (references) to
    nodes that were created in earlier trees.

  - A node contains 

        - versioned text
        - versioned properties
        - predecessor history:  "which node am I a variant of?"
        - copy history:  "which node am I a copy of?"

    The history values can be non-existent (meaning the node is
    completely new), or can have a value of {revision, path}.


------------------------------------------------------------------------
Refinement of proposal #2:  (after discussion with gstein)
=========================

Each node starts with RFC822-style headers at the top.  The final
header is a 'Content-length:', followed by the content, so record
boundaries can be inferred.

The content section has two implicit parts: a property hash, and the
fulltext.  The division between these two sections is implied by the
"PROPS-END\n" tag at the end of the prophash.  In the case of a
directory node or a revision, only the prophash is present.

-----------------------------------------------------------------

SVN DUMPFILE VERSION 1 FORMAT

The format starts with the version number of the dump format
("SVN-fs-dump-format-version: 1\n"), followed by a series of revision
records.  Each revision record starts with information about the
revision, followed by a variable number of node changes for that
revision.  Fields in [braces] are optional, and unknown headers are
always ignored, for backwards compatibility.

Revision-number: N
Prop-content-length: P
Content-length: L

   ...P bytes of property data.  Properties are stored in the same
   human-readable hashdump format used by working copy property files,
   except that they end with "PROPS-END\n" for better readability.

Node-path: /absolute/path/to/node/in/filesystem
Node-kind: file | dir  (1)
Node-action: change | add | delete | replace
[Node-copyfrom-rev: X]
[Node-copyfrom-path: /path ]
[Text-copy-source-md5: blob] (2)
[Text-content-md5: blob]
[Text-content-length: T]
[Prop-content-length: P]
Content-length: Y (3)

   ... Y bytes of content data, divided into P bytes of "property"
   data and T bytes of "text" data.  The properties come first; their
   total length (including formatting) is Prop-content-length, and is
   included in Node-content-length.  The "PROPS-END\n" line always
   terminates the property section if there are props.  The remainder
   of the Y bytes (expected to be equivalent to Text-content-length)
   represent the contents of the node.


Notes:

(1) if the node represents a deletion, this field is optional.

(2) this is a checksum of the source of the copy.  a loader process
    can use this checksum to determine that the copyfrom path/rev
    already present in a filesystem is really the *correct* one to use.

(3) the Content-length header is technically unnecessary, since the
    information it holds (and more) can be found in the
    Prop-content-length and Text-content-length fields.  Though
    Subversion itself does not make use of the header when reading a
    dumpfile, we include it for compatibility with generic RFC822
    parsers.

-----------------------------------------------------------------
EXAMPLE

Here's an example of revision 1422, whereby I added a new directory
"baz", added a new file "bop" inside it, and modified the file "foo.c":


Revision-number: 1422
Prop-content-length: 80
Content-length: 80

K 6
author
V 7
sussman
K 3
log
V 17
Added two files, changed a third.
PROPS-END

Node-path: bar/baz
Node-kind: dir
Node-action: add
Prop-content-length: 35
Content-length: 35

K 10
svn:ignore
V 4
TAGS
PROPS-END


Node-path: bar/baz/bop
Node-kind: file
Node-action: add
Prop-content-length: 76
Text-content-length: 54
Content-length: 130

K 14
svn:executable
V 2
on
K 12
svn:keywords
V 15
LastChangedDate
PROPS-END
Here is the text of the newly added 'bop' file.
Whee.

Node-path: bar/foo.c
Node-kind: file
Node-action: change
Text-content-length: 102
Content-length: 102

Here is the fulltext of my change to an existing /bar/foo.c.
Notice that this file has no properties.

-------------------------------------

SVN DUMPFILE VERSION 2 FORMAT

This format is equivalent to the VERSION 1 format in every respect,
except for the following:

1.) The format starts with the new version number of the dump format
    ("SVN-fs-dump-format-version: 2\n").

2.) In addition to "Revision Records", another sort of record is supported:
    the "UUID" record, which should be of the form:

UUID: 7bf7a5ef-cabf-0310-b7d4-93df341afa7e

    This should be used to indicate the UUID of the originating repository.


