I am creating a small project to implement a simple version control system for information records, such as what would be present in a Wiki.
One example of such a system that many of us are familiar with is Wikipedia (www.wikipedia.org). The goal of that website is to create a large record of human knowledge based on entries created and modified by the public. Anyone with an account can add or modify records on topics that interest them. The underlying belief is that continuous editing and expansion by the public tends to improve the quality and quantity of entries in the website.
In order to minimize any potential damage from vandalism, Wikipedia uses a version control system that stores every version of every record entered into the system. For my terminology, a “record” represents an entry in Wikipedia (e.g., “United States Congress”) while a “version” of a record represents the value of that record at one time (e.g., an original description of Congress or a revised version of the original description of Congress). Wikipedia’s version control system stores the username of the person making the edit, as well as the time and date of the edit, and an editor’s comment for the edit, so the reasons and virtue of each edit can be assessed. The virtue of an edit is determined by comparing a current (edited) version of a record to the previous version of the same record to highlight the changes made by the editor. If those changes are beneficial, the edit is retained. Otherwise, the edits may be reversed so that an earlier version of the record becomes the current version once again. Quality control is performed by the general public and by people who sign up as moderators on certain topics and are notified when changes are made within those topics.
Several design decisions remain to be made. First, should version control be implemented on a (table) record level or on a field level? In other words, if a table has two fields (e.g., A and B), would a simultaneous change to A and B represent two changes or one? For reasons that will become clear, I will be implementing simultaneous changes to multiple fields as a single edit.
Second, a decision must be made on how to store the versioned records. Several choices are possible, but I selected the simplest approach for my purposes (which may not be the best performing approach). That approach stores all versions of each record in a single table rather than storing the “latest” versions of all records in one table and “older” versions in another table. The reasons for this decision will become clear later in this case study.
Third, I wanted to maintain a record of how later record versions were derived from earlier record versions. For example, for a particular record (e.g., the story about Congress), I wanted to be able to ascertain that the current version might be version “D,” and that the history of versioning for this record was to go from “A” to “B” to “C” and finally to “D.” Alternatively, where versions were reverted, that should also be apparent. For example, another version history could be “A” to “B” to “C” to “A” to “D.” This history represents that edits “B” and “C” were reverted, and version “D” was subsequently created by modifying version “A.” This record can be maintained by storing an index to the preceding version in the editing sequence. Thus, “B” would have a preceding version index for “A.”
Fourth, I wanted to maintain who made the edits, when they made them, and store any comments they offer about the reason for the edits.
Finally, I wanted to maintain a record of whether the edits had been formally reviewed. Therefore, I added a Boolean flag to the record to indicate approval.
Combining all these factors, I created a table with the following fields:
1) record_id (for uniquely identifying a record, in the sense of the word described above (e.g., a description of Congress);
2) version_id (for distinguishing between versions of a record – e.g., a second edit to the entry on Congress);
3) parent_version_id (for identifying which version of the record preceded this version of the record in the editing sequence);
4) approved_flag (for indicating that the edited record has been approved);
5) editor_id (who made the edit to create this version of the record);
6) editor_timedate (when the editor made the edit); and
7) editor_comment (any explanation for the edit provided by the editor).
The fields above are in addition to the content field (a character string). I also added an auto-increment field as the primary key for the table.
In light of the design goals and trade-offs described, any comments or concerns would be appreciated.
thanks
Dave
One example of such a system that many of us are familiar with is Wikipedia (www.wikipedia.org). The goal of that website is to create a large record of human knowledge based on entries created and modified by the public. Anyone with an account can add or modify records on topics that interest them. The underlying belief is that continuous editing and expansion by the public tends to improve the quality and quantity of entries in the website.
In order to minimize any potential damage from vandalism, Wikipedia uses a version control system that stores every version of every record entered into the system. For my terminology, a “record” represents an entry in Wikipedia (e.g., “United States Congress”) while a “version” of a record represents the value of that record at one time (e.g., an original description of Congress or a revised version of the original description of Congress). Wikipedia’s version control system stores the username of the person making the edit, as well as the time and date of the edit, and an editor’s comment for the edit, so the reasons and virtue of each edit can be assessed. The virtue of an edit is determined by comparing a current (edited) version of a record to the previous version of the same record to highlight the changes made by the editor. If those changes are beneficial, the edit is retained. Otherwise, the edits may be reversed so that an earlier version of the record becomes the current version once again. Quality control is performed by the general public and by people who sign up as moderators on certain topics and are notified when changes are made within those topics.
Several design decisions remain to be made. First, should version control be implemented on a (table) record level or on a field level? In other words, if a table has two fields (e.g., A and B), would a simultaneous change to A and B represent two changes or one? For reasons that will become clear, I will be implementing simultaneous changes to multiple fields as a single edit.
Second, a decision must be made on how to store the versioned records. Several choices are possible, but I selected the simplest approach for my purposes (which may not be the best performing approach). That approach stores all versions of each record in a single table rather than storing the “latest” versions of all records in one table and “older” versions in another table. The reasons for this decision will become clear later in this case study.
Third, I wanted to maintain a record of how later record versions were derived from earlier record versions. For example, for a particular record (e.g., the story about Congress), I wanted to be able to ascertain that the current version might be version “D,” and that the history of versioning for this record was to go from “A” to “B” to “C” and finally to “D.” Alternatively, where versions were reverted, that should also be apparent. For example, another version history could be “A” to “B” to “C” to “A” to “D.” This history represents that edits “B” and “C” were reverted, and version “D” was subsequently created by modifying version “A.” This record can be maintained by storing an index to the preceding version in the editing sequence. Thus, “B” would have a preceding version index for “A.”
Fourth, I wanted to maintain who made the edits, when they made them, and store any comments they offer about the reason for the edits.
Finally, I wanted to maintain a record of whether the edits had been formally reviewed. Therefore, I added a Boolean flag to the record to indicate approval.
Combining all these factors, I created a table with the following fields:
1) record_id (for uniquely identifying a record, in the sense of the word described above (e.g., a description of Congress);
2) version_id (for distinguishing between versions of a record – e.g., a second edit to the entry on Congress);
3) parent_version_id (for identifying which version of the record preceded this version of the record in the editing sequence);
4) approved_flag (for indicating that the edited record has been approved);
5) editor_id (who made the edit to create this version of the record);
6) editor_timedate (when the editor made the edit); and
7) editor_comment (any explanation for the edit provided by the editor).
The fields above are in addition to the content field (a character string). I also added an auto-increment field as the primary key for the table.
In light of the design goals and trade-offs described, any comments or concerns would be appreciated.
thanks
Dave
Comment