Technical Infrastructure
Free software projects rely on collaboration
technologies: tools that support the selective capture and
integration of digitally-expressed human intentions about a shared
project. The more skilled you are at using these tools, and at
persuading others to use them, the more successful your project will
be.
This only becomes
more true as the project grows. Smart information management is what
prevents open source projects from collapsing under the weight of
Brooks' Law,From his book The Mythical Man
Month, 1975. See https://en.wikipedia.org/wiki/The_Mythical_Man-Month, https://en.wikipedia.org/wiki/Brooks_Law, and
https://en.wikipedia.org/wiki/Fred_Brooks.
which states that adding more people to a late software project makes it
later. Fred Brooks observed that the complexity of communications in
a project
increases as the square of the number of
participants. When only a few people are involved, everyone can easily
talk to everyone else, but when hundreds of people are involved, it is
no longer possible for each person to remain constantly aware of what
everyone else is doing. If good free software project management is
about making everyone feel like they're all working together in the
same room, the obvious question is: what happens when everyone in a
crowded room tries to talk at once?
This problem is not new. In real-world crowded rooms, the
solution is parliamentary procedure: formal
guidelines for how to have real-time discussions in large groups, how
to make sure important dissents are not lost in floods of "me-too"
comments, how to form subcommittees, how to recognize and record when
decisions
are made, etc. An important part of parliamentary procedure is
specifying how the group interacts with its information management
system. Some remarks are made "for the record", others are not. The
record itself is subject to direct manipulation, and is understood to
be not a literal transcript of what occurred but rather a representation of
what the group is willing to agree occurred. The
record is not monolithic; it takes different forms for different
purposes. It comprises the minutes of individual meetings, the
complete collection of all minutes of all meetings, summaries, agendas
and their annotations, committee reports, reports from correspondents
not present, lists of action items, etc.
Because the Internet is not really a room, we can dispense with
those parts of parliamentary procedure that
keep some people quiet while others are speaking. But when it comes
to information management techniques, well-run open source projects
are parliamentary procedure on steroids. Since almost all
communication in open source projects happens in writing, elaborate
systems have evolved for routing and labeling data appropriately, for
minimizing repetitions so as to avoid spurious divergences, for
storing and retrieving data, for correcting bad or obsolete
information, and for associating disparate bits of information with
each other as new connections are observed.
Active participants in
open source projects internalize many of these techniques, and will
often perform complex manual tasks to ensure that information is
routed correctly. But the whole endeavor ultimately depends on
sophisticated software support. As much as possible, the
communications media themselves should do the routing, labeling, and
recording, and should make the information available to humans in the
most convenient way possible. In practice, of course, humans will
still need to intervene at many points in the process, and it's
important that the software make such interventions convenient too.
But in general, if the humans take care to label and route information
accurately on its first entry into the system, then the software
should be configured to make as much use of that metadata as
possible.
The advice in this chapter is intensely practical, based on
experiences with specific software and usage patterns. But the point
is not just to teach a particular collection of techniques. It is
also to demonstrate, by means of many small examples, the overall
attitude that will best encourage good information management in your
project. Promoting this attitude will involve a combination of technical skills
and people skills. The technical skills are essential because
information management software always requires configuration, plus a
certain amount of ongoing maintenance and tweaking as new needs arise
(for example, see the discussion of how to handle project growth in
). The people skills are necessary
because the human community also requires maintenance: it's not always
immediately obvious how to use these tools to full advantage, and in
some cases projects have conflicting conventions (for example, see the
discussion of setting Reply-to headers on
outgoing mailing list posts, in ).
Everyone involved with the project will need to be encouraged, at the
right times and in the right ways, to do their part to keep the
project's information well organized. The more interested the
contributor, the more complex and specialized the techniques she will
be willing to learn.
The right techniques for your project may change over time, as
collaboration technology changes and as your project changes.
You may finally get everything configured just
the way you want it, and have most of the community participating, but
then project growth will make some of those practices unscalable. Or
project growth may stabilize, and the developer and user communities
settle into a comfortable relationship with the technical
infrastructure, but then someone will come along and invent a whole
new information management service, and pretty soon newcomers will be
asking why your project doesn't use it — for example, this
happened to a lot of free software projects that predate the invention
of the wiki (see https://en.wikipedia.org/wiki/Wiki), and more recently has been
happening to projects whose workflows were developed before the rise
of GitHub PRs (see ) as the canonical
way to package proposed contributions. Many infrastructure questions
are matters of judgement, involving tradeoffs between the convenience
of those producing information and the convenience of those consuming
it, or between the time required to configure information management
software and the benefit it brings to the project.
Beware of the temptation to over-automate, that is, to automate
things that really require human attention. Technical infrastructure
is important, but what makes a free software project work is
care — and intelligent expression of that care — by the humans
involved. The technical infrastructure is really about giving humans
easy opportunities to apply care.
What a Project Needs
Most open source projects offer at least this minimum, standard set
of tools for managing information:
Web site
Primarily a centralized, one-way conduit of
information from the project out to the public and to
participants. The web site may also serve as a portal
leading to other project tools. See
.
Message forums / Mailing lists
Usually the most active communications forum in the
project, and the "medium of record." See
.
Version control
Enables developers to manage code changes conveniently,
including reverting and "change porting". Enables
everyone to watch what's happening to the code. See
.
Bug tracking
Enables developers to keep track of what they're working
on, coordinate with each other, and plan releases. Enables
everyone to query the status of bugs and record
information (e.g., reproduction recipes) about particular
bugs. Can be used for tracking not only bugs, but also
tasks, releases, new features, etc. See
.
Real-time chat
A place for quick, lightweight discussions and
question/answer exchanges. Not always archived
completely. See .
Each tool in this set addresses a distinct need, but their functions
are also interrelated, and the tools must be made to work together.
Below we will examine how they can do so, and more importantly, how to
get people to use them.
You may be able to avoid a lot of the headache of choosing and
configuring many of these tools by using a canned
hosting site: an online service that offers prepackaged,
templatized web services with some or all of the collaboration tools
needed to run a free software project. See
for a discussion of the advantages and
disadvantages of canned hosting.
Web Site
For our purposes, the web site means web
pages devoted to helping people participate in the project as
developers, documenters, etc. Note that this may be different from the
main user-facing web site. In many projects, users have different
needs and often (statistically speaking) a different mentality from
the developers. The kinds of web pages most helpful to users are not
always the same as those helpful for developers. Don't try to make a
"one size fits all" web site just to save some writing and maintenance
effort: you'll end up with a site that is not quite right for either
audience.
The two types of sites should cross-link, of course, and in
particular it's important that the user-oriented site have, tucked a
way in a corner somewhere, a clear link to the developers' site, since
most new developers will start out at the user-facing pages and look
for a path from there to the developers' area.
An example may make this clearer. As of this writing in
February 2022, the office suite LibreOffice has its main user-oriented
web site at https://www.libreoffice.org/, as you'd expect. If you were a user wanting
to download and install LibreOffice, you'd start there, go straight to
the "Download" link, and so on. But if you were a developer looking
to fix a bug in LibreOffice, you might
start at https://www.libreoffice.org/, but you'd be looking for a link that says
something like "Developers", or "Development", or "Get
Involved" — in other words, you'd be looking for the
gateway to the development area.
LibreOffice, like other large projects, has a few different
gateways to developer-land. There's a prominent link partway down the
page that says "Get Involved", and at the top there's also a dropdown
menu named "Improve It" that offers a number of paths to
participation, including a "Developers" item.
The "Get Involved" page is aimed at the broadest possible range of
potential contributors: developers, yes, but also documenters,
quality-assurance testers, marketing helpers, web infrastructure
experts, financial or in-kind donors, interface designers, support
forum helpers, etc. This frees up the "Developers" page to target
the rather narrower audience of programmers interested
in improving the LibreOffice code. The set of links and short
descriptions provided on both pages is admirably clear and concise:
you can tell immediately from looking whether you're in the right
place for what you want do, and if so what the next thing to click on
is. The "Development" page gives some information about where to find
the code, how to contact the other developers, how to file bugs, and
things like that, but most importantly it points to what most seasoned
open source contributors would instantly recognize as the
real gateway to actively-maintained development
information: the development wiki at https://wiki.documentfoundation.org/Development.
This division into two contributor-facing gateways, one for all
kinds of contributions and another for coders specifically, is
probably right for a large, multi-faceted project like LibreOffice.
You'll have to use your judgement as to whether that kind of
subdivision is appropriate for your project; at least at the
beginning, it probably isn't. It's better to start with one unified
contributor gateway, aimed at all the types of contributors you
expect, and if that page ever gets large enough or complex enough to
feel unwieldy — listen carefully for complaints about
it, since you and other long-time participants will be naturally
desensitized to weaknesses in introductory
pages! — then you can divide it up however seems
best.
From a technical point of view there is not much to say about
setting up the project web site. Web hosting is easy to come by,
and most of the important things to say about layout
and arrangement were covered in the previous chapter. The web site's
main function is to present a clear and welcoming overview of the
project, and to bind together the various collaboration tools (the
version control system, bug tracker, etc). To save time and effort,
many projects just use one of the canned hosting services, as
described below.
Canned Hosting
A canned hosting site is an online
service that offers some or all of the online collaboration tools
needed to run a free software project. At a minimum, a canned hosting
site offers public version control repositories and bug tracking; most
also offer wiki space, many offer mailing list
hostingNote that even when a canned hosting site
doesn't offer message forums as a standalone feature, it will usually
offer rich notification and subscription/watch features attached to
its bug tracker and version control system, such that participants can
effectively have a message-forum-style discussion centered around a
particular bug or change. While these features are very useful, they
are not a full substitute for first-class message forums as described
in . too, and some
offer continuous integration testingSee automated-testing. and other
servicesNote that for successful free software
projects, interested commercial entities will eventually often step up
to fund many of these services anyway; see for further discussion of
this.. For many projects, canned hosting provides a
perfectly adequate developer-oriented entry point to the project, and
there is no need to set up a separate web site.
There are two main advantages to using a canned site. The first
is server maintenance: uptime monitoring, operating system upgrades,
etc. Having someone else handle that is one less thing to worry
about. The second advantage is simplicity. They have already chosen
a bug tracker, a version control system, perhaps discussion forum
software, and everything else you need to run a project. They've
configured the tools, arranged single-sign-on authentication where
appropriate, are taking care of backups for all the data stored in the
tools, etc. You don't need to make many decisions. All you have to
do is fill in a registration form, press a button, and suddenly you've
got a project development web site.
These are pretty significant benefits. The disadvantage, of
course, is that you must accept their choices and
configurations, even if something different would be better for your
project. Usually canned sites are adjustable within certain narrow
parameters, but you will never get the fine-grained control you would
have if you set up the site yourself and had full administrative
access to the server.
A perfect example of this is the handling of generated files.
Certain project web pages may be generated files — for example,
there are systems for keeping FAQ data in an easy-to-edit master
format, from which HTML, PDF, and other presentation formats can be
generated. As explained in
,
you wouldn't want to version the generated formats, only the master
file. But when your web site is hosted on someone else's server, it
may be difficult to set up a custom hook to regenerate the online
HTML version of the FAQ whenever the master file is changed.
If you choose a canned site, try to leave open the option of
switching to a different site later, by using a custom domain name as the
project's development home address. You can forward that URL to the
canned site, or have a fully customized development home page at the
main URL and link to the canned site for specific functionality. Just
try to arrange things such that if you later decide to use a different
hosting solution, the project's main address doesn't need to
change.
If you're not sure whether to use canned hosting, then you
should probably use canned hosting. These sites have integrated their
services in myriad ways (just one example: if a commit mentions a bug
ticket number using a certain format, then people browsing that commit
later will find that it automatically links to that ticket), ways that
would be laborious for you to reproduce, especially if it's your first
time running an open source project. The universe of possible
configurations of collaboration tools is vast and complex, but the
same set of choices has faced everyone running an open source project
and there are some settled solutions now. Each of the canned hosting
sites implements a reasonable subset of that solution space, and
unless you have reason to believe you can do better, your project will
probably run best by just using one of those sites.
Choosing a Canned Hosting Site
There are now so many sites providing free-of-charge canned
hosting for projects released under open source licenses that there is
not space here to review the field.
So I'll make this easy:
If you don't know what to choose, then choose GitHub (https://github.com/). It's by far the most
popular and appears set to stay that way for some years to come. It
has a good set of features and integrations. Many developers are
already familiar with GitHub and have an account there. It offers
APIs at
https://develop.github.com/ for interacting
programmatically with project resources, and starting in 2020 it
introduced message forums.That is, message forums as in
. The feature's name is "GitHub
Discussions"; you have to turn it on for your repository, as
it's not currently on by default.
If you're not convinced by GitHub (for example because your
project uses, say, Mercurial instead of Git for version control), but
you aren't sure where to host, take a look at Wikipedia's thorough
comparison at https://en.wikipedia.org/wiki/Comparison_of_open_source_software_hosting_facilities; it's
the first place to look for up-to-date, comprehensive information on
open source project hosting options.
Hosting on Fully Open Source Infrastructure
Although all the canned hosting sites use plenty of free
software in their stack, most of them also wrote some proprietary
code to glue it all together. In these cases the hosting environment
itself is not fully open source, and thus cannot be easily reproduced
by others. For example, while Git itself is free software, GitHub is
a hosted service running partly with proprietary
software — if you leave GitHub, you can't take a copy
of their infrastructure with you, at least not all of it.
Some projects would prefer a canned hosting site that runs an
entirely free software infrastructure. This might be to preserve and
signal their commitment to software freedom, and in some cases might
also be due to immediate utilitarian
considerations — for example, politically sensitive
projects that are worried about being deplatformed want to know that
they can reproduce their project's hosting independently should it
ever become necessary.
Fortunately, there are places to obtain fully free-software
commercial hosting. I will list a few examples below (as of early
2020), albeit with no pretense of completeness.
GitLab (https://gitlab.com/)
GitLab offers an excellent collaboration platform that
comes in two versions: fully free-software (they call this
their "Community Edition") and proprietary (which they call
their "Enterprise Edition".See for
why this terminology deserves scare quotes.
The proprietary edition is hosted by GitLab.com, and has a few
features the open source edition doesn't have. Interestingly,
GitLab.com themselves don't offer hosting of the strictly open
source edition, but some other companies do. Two of them are
GitLabHost BV (https://www.gitlabhost.com/) and 2nd Watch (https://www.2ndwatch.com/); you can
probably find others by searching https://partners.gitlab.com/. (It's also pretty easy to set up
your own instance of GitLab. My own company did so at https://code.librehq.com/ and it was fairly simple,
although we have to perform security upgrades frequently. This
does not mean that GitLab is disproportionately likely to have
security problems; it just means that GitLab is very popular
and therefore a lot of people are available to detect and
report problems.)
Sourcehut (https://sourcehut.org/ and https://sr.ht/)
Sourcehut offers project hosting with both Git and
Mercurial available as version control systems. It is designed
to be light, fast, and developer-focused: there is no tracking
nor advertising, all of its features work without in-browser
Javascript, and many of its features work without even
requiring a user account (e.g., some email-driven interactions
with the bug tracker). As of late 2023, it's officially still
in "public alpha", but it is stable and is fine for projects
that need reliable hosting.
Codeberg (https://codeberg.org/)
Codeberg offers zero-cost project hosting for free and
open source projects. It's run by a non-profit organization in
Germany that supports free (libre) culture, is featureful,
and is under active development as of late 2023. Codeberg's
underlying platform is Forgejo (codeberg.org/forgejo/forgejo), which is itself a
community fork made in reaction to an unexpected corporate move
in another free software project (see forgejo.org/2022-12-15-hello-forgejo for
details).
Should you host your project on fully open source
infrastructure? I can't answer that question for you, since it
ultimately depends on you and your project's philosophical positions.
However, as a practical matter, I cannot say I've seen any evidence
that the degree of software-freedom of the hosting platform has much
effect on a project's success. The vast majority of developers who
work on free software projects seem to be willing to participate
through a non-free hosting platform when that's what the project is
using.
Whether the hosting platform is itself free software or not, it
is crucial to be able to interact with project data in automatable
ways, and to have a way to export data out of the hosting platform. A
site that meets these criteria can never truly lock you in, and will
even be somewhat extensible, via its programmatic interface.
Of course, all the above applies only to the servers of the
hosting site. Your project itself should never require participants
to run proprietary software on their own
machines.The exception to this is proprietary
Javascript code that is received from the hosting site and run
confined or "sandboxed" in one tab in the user's browser. The
question of whether such code is conceptually an extension of the
server, or should be thought of as running on the client machine even
though in some senses it has more access to server resources than it does to
client resources, is a deep and ongoing debate. We won't settle it
here, but the issue is at least more complex than just which CPU is executing
the instructions.
Anonymity and Involvement
A problem that is not strictly limited to the canned sites, but
is most often found there, is the over-requirement of user
registration to participate in various aspects of the project. The
proper degree of requirement is a bit of a judgement call. User
registration helps prevent spam, for one thing, and even if every
commit gets reviewed you still probably don't want
anonymousPseudonymous is another
matter. As long as a consistent identity has accrued reputation, you
may not need to know who it actually is. strangers
pushing changes into your repository, for example.
But sometimes user registration ends up being required for tasks
that ought to be permitted to unregistered visitors, especially the
ability to file tickets in the bug tracker, and to comment on existing
tickets. By requiring a logged-in username for such actions, the
project raises the involvement bar for what should be quick,
convenient tasks. It also changes the demographics of who files bugs,
since those who take the trouble to set up a user account at the
project site are hardly a random sample even from among users who are
willing to file bugs (who in turn are already a biased subset of all
the project's users). Of course, one wants to be able to contact
someone who's entered data into the ticket tracker, but having a field
where she can enter her email address (if she wants to) would be sufficient for that.
If a new user spots a bug and wants to report it, she'll only be
annoyed at having to fill out an account creation form before she can
enter the bug into the tracker. She may simply decide not to file the
bug at all.
If you have control over which actions can be done anonymously,
make sure that at least all read-only actions are
permitted to non-logged-in visitors, and if possible that data entry
portals, such as the bug tracker, that tend to bring information from
users to developers, can also be used anonymously, although of course
anti-spam techniques, such as captchas, may still be necessary.
Message Forums / Mailing Lists
Not all projects need to use discussion forum software. For
relatively small, focused projects that are organized around a single
code repository, the email gateway features of the bug tracker (as
discussed in later in this chapter) may
be enough to sustain most conversations. When a non-technical topic
needs to be discussed, someone can just create an issue
ticket — a fake bug report,
essentially — for the topic and conduct the discussion
there. So if you think your project will get along fine without
forums, you can skip this section and just try that. It will be
obvious pretty quickly if you do need them.
Larger and more complex projects, however, will almost always
benefit from having dedicated discussion forums. This is partly
because there will be many conversations that are not attached to a
specific bug, and partly because the larger the project, the more
important it is to keep the bug tracker focused on actual bugs and
have a separate place for other kinds of discussions.
For a long time, discussion forums were mainly mailing lists,
but the distinction between mailing lists and Web-based forums is,
thankfully, slowly disappearing. Services like Google Groups (https://groups.google.com/), which
is not itself open source, and Discourse (http://www.discourse.org/), which is, have established that
cross-accessibility of message forums as mailing lists and vice versa
is the minimum bar to meet, and modern discussion management systems
reflect this.
Because of this nearly-completed unification between email lists
and web-based forumsWhich was a long time
coming — see http://www.rants.org/2008/03/06/thread_theory/ for more. And no, I'm not
too dignified to refer to my own blog post., I will
use the terms message forum and
mailing list more or less interchangeably.
They refer to any kind of message-based forum where posts are linked
together in threads (topics), people can subscribe, archives of past
messages can be browsed, and the forum can be interacted with via
email or via a web browser.
If a user is exposed to any channel besides a project's web
pages, it is most likely to be one of the project's message forums.
But before she experiences the forum itself, she will experience the
process of finding the right forum. Your project should
have a prominently-placed description of all the available public
forums, to give newcomers guidance in deciding which ones to browse or
post to first. A typical such description might say something like
this:
The mailing lists are the main day-to-day communication channels for
the Scanley community. You don't have to be subscribed to post to a
list, but if it's your first time posting (whether you're subscribed
or not), your message may be held in a moderation queue until a
human moderator has a chance to confirm that the message is not spam.
We're sorry for this delay; blame the spammers who make it necessary.
Scanley has the following lists:
users {_AT_} scanley.org:
Discussion about using Scanley or programming with the Scanley
API, suggestions of possible improvements, etc. You can browse the
users@ archives at
<<<link to archive>>>
or subscribe here:
<<<link to subscribe>>>.
dev {_AT_} scanley.org:
Discussion about developing Scanley. Maintainers and contributors
are subscribed to this list. You can browse the
dev@ archives at
<<<link to archive>>>
or subscribe here:
<<<link to subscribe>>>.
(Sometimes threads cross over between users@
and dev@, and
Scanley's developers will often participate in discussions on both
lists. In general if you're unsure where a question or post
should go, start it out on users@. If it should be a
development discussion, someone will suggest moving it over to
dev@.)
announcements {_AT_} scanley.org:
This is a low-traffic, subscribe-only list. The Scanley
developers post announcements of new releases and occasional other
news items of interest to the entire Scanley community here, but
followup discussion takes place on users@ or
dev@.
<<<link to subscribe>>>.
notifications {_AT_} scanley.org:
All code commit messages, bug tracker tickets, automated
build/integration failures, etc, are sent to this list. Most
developers should subscribe:
<<<link to subscribe>>>.
There is also a non-public list you may need to send to, although
only developers are subscribed:
security {_AT_} scanley.org:
Where the Scanley project receives confidential reports of
security vulnerabilities. Of course, the report will be made
public eventually, but only after a fix is released; see our
security procedures page for more [...]
Choosing the Right Forum Management Software
It's worth investing some time in choosing the right mailing
list management system for your project. Modern list management tools
(some of which are listed later in ) offer at least the following
features:
Both email- and web-based access
Users should be able to subscribe to the forums by email,
and read them on the web (where they are organized into
conversations or "threads", just as they would be in a
mailreader).
Moderation features
To "moderate" is to check posts, especially first-time
posts, to make sure they are not spam before they go out
to the entire list. Moderation necessarily involves
human administrators, but software can do a great deal to
make it easier on the moderators. There is more said
about moderation in
later in this chapter.
Rich administrative interface
There are many things administrators need to do besides
spam moderation — for example, removing
obsolete addresses, a task that can become urgent when a
recipient's address starts sending "I am no longer at this
address" bounces back to the list in response to every
list post (though some systems can even detect this and
unsubscribe the person automatically). If your forum
software doesn't have decent administrative capabilities,
you will quickly realize it, and should consider switching
to software that does.
Header manipulation
Some people have sophisticated filtering and replying
rules set up in their mail readers, and rely on the forum
adding or manipulating certain standard headers. See
later in this chapter
for more on this.
Archiving
All posts to the managed lists are stored and made
available on the web (see for more on the
importance of public archives). Usually the archiver is a
native part of the message forum system; occasionally, it
is a separate tool that needs to be integrated.
The point of the above list is really just to show that forum
management is a complex problem that has already been given a lot of
thought, and to some degree been solved. You don't need to become an
expert, but you will have to learn at least a little bit about
it, and you should expect list management to occupy your attention
from time to time in the course of running any free software project.
Below we'll examine a few of the most common issues.
Spam Prevention
A mailing list that takes no spam prevention measures at all
will quickly be submerged in junk emails, to the point of unusability.
Spam prevention is mandatory. It is really two distinct functions:
preventing spam posts from appearing on your mailing lists, and
preventing your mailing list from being a source of new email
addresses for spammers' harvesters.
Filtering posts
There are three basic techniques for preventing spam posts, and
most mailing list software offers all three. They are best used in
tandem:
Only auto-allow postings from
list subscribers.
This is effective as far as it goes, and also
involves very little administrative overhead, since it's
usually just a matter of changing a setting in the mailing
list software's configuration. But note that posts which
aren't automatically approved must not be simply
discarded. Instead, they should go into a moderation
queue, for two reasons. First, you want to allow
non-subscribers to post: a person with a question or
suggestion should not need to subscribe to a mailing list
just to ask a question there. Second, even
subscribers may sometimes post from an address other than
the one by which they're subscribed. Email addresses are
not a reliable method of identifying people, and shouldn't
be treated as such.
Filter posts through
spam-detection software.
If the mailing list software makes it possible (most
do), you can have posts filtered by spam-filtering
software. Automatic spam-filtering is not perfect, and
never will be, since there is a never-ending arms race
between spammers and filter writers. However, it can
greatly reduce the amount of spam that makes it through to the
moderation queue. Since the longer that queue is the
more time humans must spend examining it, any amount of
automated filtering is beneficial.
There is not space here for detailed instructions
on setting up spam filters. You will have to consult
your mailing list software's documentation for that (see
). List
software often comes with some built-in spam prevention
features, but you may want to add some third-party
filters. I've had good experiences with SpamAssassin
(https://spamassassin.apache.org/). That
is not a comment on the many other open source spam
filters out there, some of which are apparently also quite
good; I just happen to have used SpamAssassin myself and
been satisfied with it.
Moderation.
For mails that aren't automatically allowed by
virtue of being from a list subscriber, and which make it
through the spam filtering software, if any, the last stage
is moderation: the mail is routed
to a special holding area, where a human examines it and
confirms or rejects it.
Confirming a post usually takes one of two forms:
you can accept the sender's post just this once, or you
can tell the system to allow this and all future posts
from the same sender. You almost always want to do the
latter, in order to reduce the future moderation
burden — after all, someone who has made a
valid post to a forum is unlikely to suddenly turn into a
spammer later.
Rejecting is done by either marking the item to be
discarded, or by explicitly telling the system the message
was spam so the system can improve its ability to
recognize future spams. Sometimes
you also have the option to automatically discard future
mails from the same sender without them ever being held in
the moderation queue, but there is rarely any point doing
this, since spammers don't send from the same address
twice anyway.
Oddly, most message-forum systems have not yet given
the moderation queue administrative interface the
attention it deserves, considering how common the task is,
so moderation often still requires more clicks and UI
gestures than it should. I hope this situation will
improve in the future. In the meantime, perhaps knowing
you're not alone in your frustration will temper your
disappointment somewhat.
Use the Moderation Channel Only for Moderation
Be sure to use moderation only for
filtering out spams, and perhaps for clearly off-topic messages such
as when someone accidentally posts to the wrong mailing list.
Although the moderation system may give you a way to respond directly
to the sender, you should never use that method to answer questions
that really belong on the mailing list itself, even if you know the
answer off the top of your head. To do so would deprive the project's
community of an accurate picture of what sorts of questions people are
asking, and deprive people of a chance to answer questions themselves
and/or see answers from others. (This is really just a special case
of the advice in .)
Mailing list moderation is strictly about keeping the list free of
spam and of wildly off-topic or otherwise inappropriate emails,
nothing more.
The Great Reply-to Debate
Earlier, in , I stressed the
importance of making sure discussions stay in public forums, and
talked about how active measures are sometimes needed to prevent
conversations from trailing off into private email threads;
furthermore, this chapter is all about setting up project
communications software to do as much of the work for people as possible.
Therefore, if the mailing list management software offers a way to
automatically cause discussions to stay on the list, you would think
turning on that feature would be the obvious choice.
Well, not quite. There is such a feature, but it has some
pretty severe disadvantages. The question of whether or not to use it
is one of the hottest debates in mailing list
management — admittedly, not a controversy that's likely to make
the evening news in your city, but it can flare up from time to time
in free software projects. Below, I will describe the feature, give
the major arguments on both sides, and make the best recommendation I
can.
The feature itself is very simple: the mailing list software
can, if you wish, automatically set the Reply-to header on every post
to redirect replies to the mailing list. That is, no matter what the
original sender puts in the Reply-to header (or even if they don't
include one at all), by the time the list subscribers see the post,
the header will contain the list address:
Reply-to: discuss@lists.example.org
On its face, this seems like a good thing. Because virtually
all mail reading software pays attention to the Reply-to header, now
when anyone responds to a post, their response will be automatically
addressed to the entire list, not just to the sender of the message
being responded to. Of course, the responder can still manually
change where the message goes, but the important thing is that
by default replies are directed to the list.
It's a perfect example of using technology to encourage
collaboration.
Unfortunately, there are some disadvantages. The first is known
as the Can't Find My Way Back Home problem:
sometimes the original sender will put their "real" email address in
the Reply-to field, because for one reason or another they send email
from a different address than where they receive it. People who
always read and send from the same location don't have this problem,
and may be surprised that it even exists. But for those who have
unusual email configurations, or who cannot control how the From
address on their mails looks (perhaps because they send from work and
do not have any influence over the IT department), using Reply-to may
be the only way they have to ensure that responses reach them. When
such a person posts to a mailing list that she's not subscribed to, her
setting of Reply-to becomes essential information. If the list
software overwrites it,In theory, the list software
could add the list's address to whatever
Reply-to destination were already present, if any, instead of
overwriting. In practice, for reasons I don't know, most list
software overwrites instead of appending. she may
never see the responses to her post.
The second disadvantage has to do with expectations, and in my
opinion is the most powerful argument against Reply-to munging. Most
experienced mail users are accustomed to two basic methods of
replying: reply-to-all and
reply-to-author. All modern mail reading
software has separate keys for these two actions. Users know that to
reply to everyone (that is, including the list), they should choose
reply-to-all, and to reply privately to the author, they should choose
reply-to-author. Although you want to encourage people to reply to
the list whenever possible, there are certainly circumstances where a
private reply is the responder's prerogative — for example, they
may want to say something confidential to the author of the original
message, something that would be inappropriate on the public
list.
Now consider what happens when the list has overridden the
original sender's Reply-to. The responder hits the reply-to-author
key, expecting to send a private message back to the original author.
Because that's the expected behavior, he may not bother to look
carefully at the recipient address in the new message. He composes
his private, confidential message, one which perhaps says embarrassing
things about someone on the list, and hits the send key.
Unexpectedly, a few minutes later his message appears on the
mailing list! True, in theory he should have looked
carefully at the recipient field, and should not have assumed anything
about the Reply-to header. But authors almost always set Reply-to to
their own personal address (or rather, their mail software sets it for
them), and many longtime email users have come to expect that. In
fact, when a person deliberately sets Reply-to to some other address,
such as the list, she usually makes a point of mentioning this in the
body of her message, so people won't be surprised at what happens when
they reply.
Because of the possibly severe consequences of this unexpected
behavior, my own preference is to configure list management software
to never touch the Reply-to header. This is one instance where using
technology to encourage collaboration has, it seems to me, potentially
dangerous side-effects. However, there are also some powerful
arguments on the other side of this debate. Whichever way you choose,
you will occasionally get people posting to your list asking why you
didn't choose the other way. Since this is not something you ever
want as the main topic of discussion on your list, it might be good to
have a canned response ready, of the sort that's more likely to stop
discussion than encourage it. Make sure you do
not insist that your decision, whichever it is,
is obviously the only right and sensible one (even if you think that's
the case). Instead, point out that this is a very old debate, there
are good arguments on both sides, no choice is going to satisfy
all users, and therefore you just made the best decision you
could. Politely ask that the subject not be revisited unless someone
has something genuinely new to say, then stay out of the thread and
hope it dies a natural death. (See also .)
Someone may suggest a vote to choose one way or the other. You
can do that if you want, but I personally do not feel that counting
heads is a satisfactory solution in this case. The penalty for
someone who is surprised by the behavior is so huge (accidentally
sending a private mail to a public list), and the inconvenience for
everyone else is fairly slight (occasionally having to remind someone
to respond to the whole list instead of just to you), that it's not
clear that a majority should be able to put a minority at such
risk.
I have not addressed all aspects of this issue here, just the
ones that seemed most important. For a full discussion, see
these two canonical documents, which are the ones people always cite
when they're having this debate:
Leave Reply-to alone,
by Chip Rosenthal
https://unicom.crosenthal.com/pw/reply-to-harmful.html
Set Reply-to to list,
by Simon Hill
https://web.archive.org/web/20090223102606/http://www.metasystema.net/essays/reply-to.mhtml
Despite the mild preference indicated above, I do not feel there
is a "right" answer to this question,Although there is,
of course, a right answer, and it is to leave the original author's
Reply-to untouched. The relevant standards document, http://www.ietf.org/rfc/rfc2822.txt, says "When
the 'Reply-To:' field is present, it indicates the mailbox(es) to
which the author of the message suggests that replies be
sent." and happily participate in many
lists that do set Reply-to. The most important
thing you can do is settle on one way or the other early, and try not
to get entangled in debates about it after that. When the debate
re-arises every few years, as it inevitably will, you can point people
to the archived discussion from last time.
Two Fantasies
Someday, someone will get the bright idea to implement a
reply-to-list key in a mail reader. It would
use some of the custom list headers mentioned earlier to figure out
the address of the mailing list, and then address the reply directly
to the list only, leaving off any other recipient addresses, since
most are probably subscribed to the list anyway. Eventually, other
mail readers will pick up the feature, and this whole debate will go
away.
(Actually, the Mutt (http://www.mutt.org/) mail reader does offer this feature.
Then shortly after the first edition of this book appeared, Michael
Bernstein wrote me to say: "There are other email clients that
implement a reply-to-list function besides Mutt. For example,
Evolution has this function as a keyboard shortcut, but not a button
(Ctrl+L).")
An even better solution would be for Reply-to munging to be a
per-subscriber preference in the list management software. Those who
want the list to set Reply-to munged — either on
posts they receive or posts they send — could ask for
that, and those who don't would ask for Reply-to to be left alone.
However, I don't know of any currently-maintained software that offers
this on a per-subscriber basis.
Archiving
Every discussion forum should be fully archived. It's common
for new discussions to refer to old ones, and often people doing an
Internet search will find a solution to a problem by stumbling across
a message that had been casually posted to a mailing list by some
stranger. Archives also provide history and context for new users and
developers who are becoming more involved in the project.
The technical details of setting up archiving are specific to
the software that's running the forum, and are beyond the scope of
this book. If you need to choose or configure an archiver, consider
these properties:
Prompt updating
People will often want to refer to an archived message
that was posted recently. If possible, the archiver
should archive each post instantaneously, so that by the
time a post appears on the mailing list, it's already
present in the archives. If that option isn't available,
then at least try to set the archiver to update itself
every hour or so. (By default, some archivers run their
update processes once per night, but in practice that's
far too much lag time for an active mailing list.)
Referential stability
Once a message is archived at a particular URL, it should
remain accessible at that exact same URL forever.
Even if the archives are
rebuilt, restored from backup, or otherwise fixed, any
URLs that have already been made publicly available
should remain the same. Stable references make it
possible for Internet search engines to index the
archives, which is a major boon to users looking for
answers. Stable references are also important because
mailing list posts and threads are often linked to from
other places, such as from the bug tracker (see
) or
from other project documents.
Ideally, mailing list software would include a message's
archive URL, or at least the message-specific portion of
the URL, in a header or footer when it distributes the message to
recipients. That way people who have a copy of the
message would be able to instantly know its archive location
without having to actually visit the archives, which would
be helpful because any operation that involves web
browsing is automatically time-consuming. Whether any
mailing list software actually offers this feature, I don't
know; unfortunately, the ones I have used do not.
However, it's something to look for (or, if you write
mailing list software, it's a feature to consider
implementing, please).
Thread support
It should be possible to go from any individual message to
the thread (group of related
messages) that the original message is part of. Each
thread should have its own URL too, separate from the URLs
of the individual messages in the thread.
Searchability
An archiver that doesn't support searching — on the
bodies of messages, as well as on authors and
subjects — is close to useless. Note that some archivers
support searching by simply farming the work out to an
external search engine such as Google. This is
acceptable, but direct search support is usually more
fine-tuned, because it allows the searcher to specify that
the match must appear in a subject line versus the body,
for example.
The above is just a technical checklist to help you evaluate and
set up an archiver. Getting people to
actually use the archiver to the project's
advantage is discussed in later chapters, in particular
.
Mailing List / Message Forum Software
Here are some tools for running message forums. If the site
where you're hosting your project already has a default setup, then
you can just use that and avoid having to choose. But if you need to
install one yourself, below are some possibilities. (Of course, there
are probably other tools out there that I just didn't happen to find,
so don't take this as a complete list).
Discourse — https://discourse.org/
Discourse was built to be the One True Discussion System for
Web and mobile, and so far it seems to be living up to its
promise. It is open source, supports both browser-based and
email-based participation in discussions, and is under active
development with commercial support available. You can
purchase hosted discourse if you don't want to set up
yourself.
Sympa — https://www.sympa.org/
Sympa is developed and maintained by a consortium of French
universities. It is designed for a given instance to handle
both very large lists (> 1,000,000 members) and a large
number of lists. Sympa can work with a variety of
dependencies; for example, you can run it with sendmail,
postfix, qmail or exim as the underlying message transfer
agent. It has built-in Web-based archiving.
Mailman — http://www.list.org/
For many years, Mailman was the standard for open source
project mailing lists. It comes with a built-in archiver
and has hooks for plugging in external archivers.
Mailman is very reliable in terms of message delivery and
other under-the-hood functionality, but its reputation
suffered for a while because of various user interface issues
in its aging 2.x code base (especially for spam moderation
and subscription moderation), and delays in shipping its
long-awaited 3.0 release.
However, Mailman 3.0 has now shipped, and is worth a look.
It should solve many of the problems of Mailman 2, and may
make Mailman a reasonable choice again. This excellent
article by Sumana Harihareswara describes the major
improvements: https://lwn.net/Articles/638090/.
Google Groups — https://groups.google.com/
Listing Google Groups here was a tough call. The service is
not itself open source, and a few of its administrative
functions can be a bit hard to use. However, its advantages
are substantial: your group's archives are always online and
searchable; you don't have to worry about scalability,
backups, or other run-time infrastructure issues; the
moderation and spam-prevention features are pretty good (with
the latter constantly being improved, which is important in
the neverending spam arms race); and Google Groups are easily
accessible via both email and web, in ways that are likely to
be already familiar to many participants. These are strong
advantages. If you just want to get your project started,
and don't want to spend too much time thinking about what
message forum software or service to use, Google Groups
is a good default choice.
Version Control
A version control system (or
revision control system) is a combination of
technologies and practices for tracking and controlling changes to a
project's files, in particular to source code, documentation, and web
pages. If you have never used version control before, the first thing
you should do is go find someone who has, and get them to join your
project. These days, everyone will expect at least your project's
source code to be under version control, and probably will not take
the project seriously if it doesn't use version control with at least
minimal competence.
The reason version control is so universal is that it helps with
virtually every aspect of running a project: inter-developer
communications, release management, bug management, code stability and
experimental development efforts, and attribution and authorization of
changes by particular developers. The version control system provides
a central coordinating force across all of these areas. The core of
version control is change management:
identifying each discrete change made to the project's files,
annotating each change with metadata like the change's date and
author, and then replaying these facts to whoever asks, in whatever
way they ask. It is a communications mechanism where a change is the
basic unit of information.
This section does not discuss all aspects of using a version
control system. It's so all-encompassing that it must be addressed
topically throughout the book. Here, we will concentrate on choosing
and setting up a version control system in a way that will foster
cooperative development down the road.
Version Control Vocabulary
This book cannot teach you how to use version control if you've
never used it before, but it would be impossible to discuss the
subject without a few key terms. These terms are useful independently
of any particular version control system: they are the basic nouns and
verbs of networked collaboration, and will be used generically
throughout the rest of this book. Even if there were no version
control systems in the world, the problem of change management would
remain, and these words give us a language for talking about that
problem concisely.
If you're comfortably experienced with version control already,
you can probably skip this section. If you're not sure, then read
through this section at least once. Certain version control terms
have gradually changed in meaning since the early 2000s, and you may
occasionally find people using them in incompatible ways in the same
conversation. Being able to detect that phenomenon early in a
discussion can often be helpful.
commit
To make a change to the project. More formally: to
store a change in the version control database in such a way that it
can be incorporated into future releases of the project. "Commit"
can be used as a verb or a noun. For example: "I just committed a
fix for the server crash bug people have been reporting on Mac OS X.
Jay, could you please review the commit and check that I'm not
misusing the allocator there?"
push
To publish a commit to a publicly online repository,
from which others can incorporate it into their copy of the
project's code. When one says one has pushed a commit, the
destination repository is usually implied. Usually it is the
project's authoritative repository, the one from which public
releases are made.
Note that in some older version control systems (e.g.,
Subversion), commits are automatically and unavoidably pushed up to
a predetermined central repository, while in most newer systems
(e.g., Git, Mercurial) the developer chooses when and where to push
commits. Because the former privileges a particular central
repository, they are known as "centralized" version control systems,
while the latter are known as "decentralized". In general,
decentralized systems are the modern
trend,Decentralized version control has actually
been around for a long time, but only relatively recently did it
become the most popular form of version control. It is now the
assumed default, especially for open source — in both senses: that
is, the version control systems are themselves open source, and are
intended to be suitable for managing open source software
projects. especially for open source projects,
which benefit from the peer-to-peer relationship between developers'
repositories.
pull
(or
"update" or sometimes
"fetch")
To pull others' changes (commits) into your copy of the
project. When pulling changes from a project's mainline
development branch (see ),
people often say "update" instead of "pull", for example: "Hey, I
noticed the indexing code is always dropping the last byte. Is this
a new bug?" "Yes, but it was fixed last week — try updating and
it should go away."
Note that in Git, "pull" and "fetch" are somewhat different.
To fetch
means to obtain the latest changes from a
remote repository (e.g., from the authoritative upstream repository)
and store them at the ready in your local repository, but
without merging them locally — in essence, it
means "synchronize my local copy of the remote repository with the
remote repository". To pull
means to fetch and then
automatically merge the received changes locally (setting conflict
markers if there are conflicts). Opinions differ on whether it is
better to fetch and then manually merge, or to just pull every time;
it depends both on your personal development style and on how the
project as a whole manages changes.
Despite this difference, even in Git-based projects developers
may colloquially say "fetch" to refer to obtaining changes, without
meaning fetch
specifically as opposed to
pull
.
See also .
commit message or log message
A bit of commentary attached to each commit,
describing the nature and purpose of the commit (both terms are used
about equally often; I'll use them interchangeably in this book).
Log messages are among the most important documents in any project:
they are the bridge between the detailed, highly technical meaning
of each individual code changes and the more user-visible world of
bugfixes, features and project progress. Later in this section,
we'll look at ways to distribute them to the appropriate audiences;
also,
discusses ways to encourage contributors to write concise and useful
commit messages.
repository A
database in which changes are stored and from which they are
published. In centralized version control systems, there is a
single, authoritative repository on a remote server; that repository
records all changes to the project, and each developer works with a
snapshot of the latest version on her own machine. In decentralized
systems, each developer has her own repository, changes can be
swapped back and forth between repositories arbitrarily, and the
question of which repository is authoritative (that is, the one from
which public releases are rolled) is defined purely by social
convention, instead of by a combination of social convention and
technical enforcement.
clone (see also
"checkout")
To obtain one's own development repository by making
a copy of the project's central repository.
checkout
When used in discussion, "checkout" usually means
something like "clone", except that centralized systems don't really
clone the full repository, they just obtain a working copy. When
decentralized systems use the word "checkout", they also mean the
process of obtaining working files from a repository, but since the
repository is local in that case, the user experience is quite
different because the network is not involved.
In the centralized sense, a checkout produces a directory tree
called a "working copy" (see below), from which changes may be
sent back to the original repository.
working copy or working files
A developer's private directory tree containing the
project's source code files, and possibly its web pages or other
documents, in a form that allows the developer to edit them. A
working copy also contains some version control metadata saying what
repository it comes from, what branch it represents, and a few other
things. Typically, each developer has her own working copy, from
which she edits, tests, commits, pulls, pushes,
etc.
In decentralized systems, working copies and repositories are
usually colocated anyway, so the term "working copy" is less often
used. Developers instead tend to say "my clone" or "my copy" or
sometimes "my fork".
revision,
change,
changeset,
or (again) commit
A "revision" is a precisely specified incarnation of
the project at a point in time, or of a particular file or directory
in the project at that time. These days, most systems also use "revision",
"change", "changeset", or "commit" to refer to a set of changes
committed together as one conceptual unit, if multiple files were
involved, though colloquially most people would refer to changeset
12's effect on file F as "revision 12 of F".
These terms occasionally have distinct technical meanings in
different version control systems, but the general idea is always
the same: they give a way to speak precisely about exact points in
time in the history of a file or a set of files (say, immediately
before and after a bug is fixed). For example: "Oh yes, she fixed
that in revision 10" or "She fixed that in commit fa458b1fac".
When one talks about a file or collection of files without
specifying a particular revision, it is generally assumed that one
means the most recent revision(s) available.
"Version" Versus "Revision"
The word version is sometimes used as a
synonym for "revision", but I will not use it that way in this
book, because it is too easily confused with "version" in the sense
of a version of a piece of software — that is, the release or
edition number, as in "Version 1.0". However, since the phrase
"version control" is already standard, I will continue to use it as
a synonym for "revision control" and "change control". Sorry. One
of open source's most endearing characteristics is that it has two
words for everything, and one word for every two things.
diff
A textual representation of a change. A diff shows
which lines were changed and how, plus a few lines of surrounding
context on either side. A developer who is already familiar with
some code can usually read a diff against that code and understand
what the change did, and often even spot bugs.
tag or snapshot
A label for a particular state of the project at a
point in time. Tags are generally used to mark interesting
snapshots of the project. For example, a tag is usually made for
each public release, so that one can obtain, directly from the
version control system, the exact set of files/revisions comprising
that release. Tag names are often things like
Release_2_0, Delivery_20211009,
etc.
branch
A copy of the project, under version control but
isolated so that changes made to the branch don't affect other
branches of the project, and vice versa, except when changes are
deliberately "merged" from one branch to another (see below).
Branches are also known as "lines of development". Even when a
project has no explicit branches, development is still considered to
be happening on the "main branch", also known
as the "main line" or
"trunk" or sometimes
"master".
Branches are a way to keep different lines of development
from interfering with each other. For example, a short-term branch
is typically used for a bugfix or a minor enhancement. Longer-term
branches can also be used for experimental development that would be
too destabilizing for the main line.
Conversely, a branch can also be used as a safely isolated
place in which to stabilize a new release. During the release
process, regular development — that is, frequent integration of
development branches — would continue uninterrupted in the main
branch; meanwhile, on the release branch, no changes are allowed
except those approved by the release managers. This way, making a
release needn't interfere with ongoing development work. See for a more detailed discussion of
branching.
merge or port
To move a change from one branch to another. This
includes merging from the main branch to some other branch, or vice
versa. In fact, those are the most common kinds of merges; it is
less common to port a change between two non-main branches. See
for more on change porting.
"Merge" has a second, related meaning: it is what some version
control systems do when they see that two people have changed the
same file but in non-overlapping ways. Since the two changes do not
interfere with each other, when one of the people updates their copy
of the file (already containing their own uncommitted changes), the other
person's changes will be automatically merged in. This is very
common, especially on projects where multiple people are hacking on
the same code. When two different changes do
overlap, the result is a "conflict"; see below.
conflict
What happens when two people try to make different
changes to the same place in the code. All version control systems
automatically detect conflicts, and notify at least one of the
humans involved that their changes conflict with someone else's. It
is then up to that human to resolve the
conflict, and to communicate that resolution to the version control
system.
revert or reversion
To undo an already-committed change to the software.
The undoing itself is a versioned event, and is usually done by
asking the version control system to reverse the change(s) in
questions, rather than by manually making the edits and committing
them.
lock
A way to declare an exclusive intent to change a
particular file or directory. For example, "I can't commit any
changes to the web pages right now. It seems Alfred has them all
locked while he fixes their background images." Not all version
control systems even offer the ability to lock, and of those that
do, not all require the locking feature to be used. This is because
parallel, simultaneous development is the norm, and locking people
out of files is (usually) contrary to this ideal.
Version control systems that require locking to make commits
are said to use the lock-modify-unlock model.
Those that do not are said to use the
copy-modify-merge model. An excellent
in-depth explanation and comparison of the two models may be found
at https://svnbook.red-bean.com/nightly/en/svn.basic.version-control-basics.html#svn.basic.vsn-models. In
general, the copy-modify-merge model is better for open source
development, and all the version control systems discussed in this
book support that model.
Choosing a Version Control System
If you don't already have an opinion about which version control
system your project should use, then choose Git (https://git-scm.com/), and host your
project's repositories at GitHub (https://github.com/), which offers unlimited free hosting for open
source projects.
Git is by now the de facto
standard in the open source world, as is hosting one's repositories at
GitHub. Because so many developers are already comfortable with that
combination, choosing it sends the signal that your project is ready
for participants. But Git-at-GitHub is not the only viable
combination. Many projects host their authoritative Git repository
somewhere else, either at another public hosting site (see ) or on their own server (perhaps using one
of the open source forge systems listed in ). Some projects use a different
version control system entirely, such as Mercurial (https://www.mercurial-scm.org/).
There isn't space here for an in-depth exploration of why you
might choose something other than Git. If you have a reason to do so,
then you already know what that reason is. If you don't, then just
use Git (on either GitHub or GitLab). If you find yourself using
something other than Git or Mercurial, ask yourself
why — because whatever that other version control
system is, most other developers won't be familiar with it, and it
likely has a smaller community of support around it than those two
do.
Using the Version Control System
The recommendations in this section are not targeted toward a
particular version control system, and should be implementable in any
of them. Consult your specific system's documentation for
details.
Version Everything
Keep not only your project's source code under version control,
but also its web pages, documentation, FAQ, design notes, and anything
else that people might want to edit. Keep them right with the
source code, in the same repository tree. Any piece of information
worth writing down is worth versioning — that is, any piece of
information that could change. Things that don't change should be
archived, not versioned. For example, an email, once posted, does not
change; therefore, versioning it wouldn't make sense (unless it becomes
part of some larger, evolving document).
The reason to version everything together in one place is so
that people only have to learn one mechanism for submitting changes.
Often a contributor will start out making edits to the web pages or
documentation, and move to small code contributions later, for
example. When the project uses the same system for all kinds of
submissions, people only have to learn the ropes once. Versioning
everything together also means that new features can be committed
together with their documentation updates, that branching the code
will branch the documentation too, etc.
Don't keep generated files under version
control. They are not truly editable data, since they are produced
programmatically from other files. For example, some build systems
create a file named configure based on a template
in configure.in. To make a change to the
configure, one would edit
configure.in and then regenerate; thus, only the
template configure.in is an "editable file."
Just version the templates — if you version the generated files as
well, people will inevitably forget to regenerate them when they commit a
change to a template, and the resulting inconsistencies will cause
endless confusion.
There are technical exceptions to the rule that all editable
data should be kept in the same version control system as the code.
For example, a project's bug tracker and its wiki hold plenty of
editable data, but usually do not store that data in the main version
control system.Some development environments have tried
to integrate everything into one unified, version-controlled world, e.g.,
https://fossil-scm.org/ and
http://veracity-scm.com/,
but so far none of them have gained widespread adoption in the open
source world. However, they should still have
versioning systems of their own, e.g., the comment history in a bug
ticket, and the ability to browse past revisions and view differences
between them in a wiki.
Browsability
The project's repository should be browsable on the Web. This
means not only the ability to see the latest revisions of the
project's files, but to go back in time and look at earlier revisions,
view the differences between revisions, read log messages for selected
changes, etc.
Browsability is important because it is a lightweight portal to
project data. If the repository cannot be viewed through a web
browser, then someone wanting to inspect a particular file (say, to
see if a certain bugfix had made it into the code) would first have to
install version control client software locally, which could turn
their simple query from a two-minute task into a half-hour or longer
task.
Browsability also implies canonical URLs for viewing a
particular change (i.e., a commit), and for viewing the latest
revision at any given time without specifying its commit identifier.
This can be very useful in technical discussions or when pointing
people to documentation or examples. If you tell someone a URL that
always points to the latest revision of the a file, or to a particular
known revision, the communication is completely unambiguous, and
avoids the issue of whether the recipient has an up-to-date working
copy of the code themselves.
Some version control systems come with built-in
repository-browsing mechanisms, and in any case all hosting sites
offer it via their web interfaces. But if you need to install a
third-party tool to get repository browsing, do so; it's worth
it.
Use Branches to Avoid Bottlenecks
Non-expert version control users are sometimes a bit afraid of
branching and merging. If you are among those people, resolve right
now to conquer any fears you may have and take the time to learn how
to do branching and merging. They are not difficult operations, once
you get used to them, and they become increasingly important as a
project acquires more developers.
Branches are valuable because they turn a scarce
resource — working room in the project's code — into an
abundant one. Normally, all developers work together in the same
sandbox, constructing the same castle. When someone wants to add a
new drawbridge, but can't convince everyone else that it would be an
improvement, branching makes it possible for her to copy the
castle, take it off to an isolated corner, and try out the new
drawbridge design. If the effort succeeds, she can invite the
other developers to examine the result (in GitHub-speak, this
invitation is known as a "pull request" — see ). If everyone agrees that the
result is good, she or someone else can tell the version control
system to move ("merge") the drawbridge from the branch version of the
castle over to the main version, usually called the
main branch.
It's easy to see how this ability helps collaborative
development. People need the freedom to try new things without
feeling like they're interfering with others' work. Equally
importantly, there are times when code needs to be isolated from the
usual development churn, in order to get a bug fixed or a release
stabilized (see and
) without worrying
about tracking a moving target. At the same time, people need to be
able to review and comment on experimental work, whether it's
happening in the main branch or somewhere else. Treating branches
as first-class, publishable objects makes all this possible.
Use branches liberally, and encourage others to use them. But
also make sure that a given branch is only active for as long as
needed. Every active branch is a slight drain on the community's
attention. Even those who are not working in a branch still stumble
across it occasionally; it enters their peripheral awareness from time
to time and draws some attention. Sometimes such awareness is
desirable, of course, and commit notices should be sent out for branch
commits just as for any other commit. But branches should not become
a mechanism for dividing the development community's efforts. With
rare exceptions, the eventual goal of most branches should be to merge
their changes back into the main line and disappear, as soon as
possible.
Singularity of Information
Merging has an important corollary: never commit the same change
twice. That is, a given change should enter the version control
system exactly once. The revision (or set of revisions) in which the
change entered is its unique identifier from then on. If it needs to
be applied to branches other than the one on which it entered, then it
should be merged from its original entry point to those other
destinations — as opposed to committing a textually identical
change, which would have the same effect in the code, but would make
accurate bookkeeping and release management much harder.
The practical effects of this advice differ from one version
control system to another. In some systems, merges are special
events, fundamentally distinct from commits, and carry their own
metadata with them. In others, the results of merges are committed
the same way other changes are committed, so the primary means of
distinguishing a "merge commit" from a "new change commit" is in the
log message. In a merge's log message, don't repeat the log message
of the original change. Instead, just indicate that this is a merge,
and give the identifying revision of the original change, with at most
a one-sentence summary of its effect. If someone wants to see the
full log message, she should consult the original revision.
Non-duplication makes it easier to be sure when one has tracked down
the original source of a change: when you're looking at a complete log
message that doesn't refer to a some other merge source, you can know
that it must be the original change, and treat it accordingly.
The same principle applies to reverting a change. If a change
is withdrawn from the code, then the log message for the reversion
should merely state that some specific revision(s) is being reverted,
and explain why. It should not describe the semantic code change that
results from the reversion, since that can be derived by consulting
the original log message and diff. (And if you're using a system in
which editing or annotating past log messages is possible, go back and
fix the original change's log message to mention the future
reversion.)
All of the above implies that you should use a consistent syntax
for referring to changes. This is helpful not only in log messages,
but in emails, the bug tracker, and elsewhere. In Git and Mercurial,
the syntax is usually "commit c39fcac089" (where the commit hash code
on the right is long enough to be unique in the relevant context). In
Subversion, revision numbers are linearly incremented integers and the
standard syntax for, say, revision 1729 is "r1729" (a syntax you'll
see in some examples in this book). Other systems have their own
standard syntaxes for expressing the changeset name. Whatever the
appropriate syntax is for your system, encourage people to use it consistently when
referring to changes. Consistent expression of change names makes
project bookkeeping much easier (as we will see in and in ).
Since a lot of this bookkeeping may be done by developers who must
also use some different bookkeeping method for internal projects at
their company, it needs to be as easy as possible.
See also
.
Authorization
Even if your project's version control system or hosting site
allows technical enforcement of developer's activity
areas — e.g., permitting them to push commits in some
places but not others — it's usually better to not to
use it. Automated enforcement is rarely necessary, and may even be
harmful.
Instead, most projects use an honor system: when a person is
granted commit access, even for a sub-area of the project, what they
actually receive is the physical ability to commit anywhere in the
authoritative repository. They're just asked to keep their commits in their
area. (See for how projects
decide who can put changes where.)
Remember that there is little real risk here: the repository
provides an audit trail, and in an active project, all commits are
reviewed anyway. If someone commits where they're not supposed to,
others will notice it and say something. If a change needs to be
undone, that's simple enough — everything's under version control
anyway, so just revert.
There are several advantages to this more relaxed approach.
First, as developers expand into other areas (which they usually will
if they stay with the project), there is no administrative overhead to
granting them wider privileges. Once the decision is made, the person
can just start committing in the new area right away.
Second, it allows such expansion to be done in a fine-grained manner.
Generally, a committer in area X who wants to expand to area Y will
start posting patches against Y and asking for review. If someone who
already has commit access to area Y sees such a patch and approves of
it, she can just tell the submitter to commit the change directly
(mentioning the approver's name in the log message, of
course). That way, the commit will come from the person who actually
wrote the change, which is preferable from both an information
management standpoint and from a crediting standpoint.
Last, and perhaps most important, using the honor system
encourages an atmosphere of trust and mutual respect. Giving someone
commit access to a subdomain is a statement about their technical
preparedness — it says: "We see you have expertise to make commits
in a certain domain, so go for it." But imposing strict authorization
controls says: "Not only are we asserting a limit on your expertise,
we're also a bit suspicious about
your intentions." That's not the sort of
statement you want to make if you can avoid it. Bringing someone into
the project as a committer is an opportunity to initiate them into a
circle of mutual trust. A good way to do that is to give them more
power than they're supposed to use, then inform them that it's up to
them to stay within agreed-on limits.
The Subversion project has operated on this honor system way for
over two decades, with more than 50 full committers and over 100
partial committers as of this writing. (Not all of them are active at
any given time, but that just reinforces the point I'm making here.)
The only distinction the system enforces by technical means is the
global distinction between committers and everyone else. All further
subdivisions are maintained solely by human discretion. Yet the
project never had a serious problem with someone deliberately
committing outside their domain. Once or twice there's been an
innocent misunderstanding about the extent of someone's commit
privileges, but it's always been resolved quickly and amiably.
Obviously, in situations where self-policing is impractical, you
must rely on hard authorization controls. But such situations are
rare. Even when there are millions of lines of code and hundreds or
thousands of developers, a commit to any given code module should
still be reviewed by those who work on that module,See
. and
they can recognize if someone committed there who wasn't supposed to.
If regular commit review isn't happening, then
the project has bigger problems to deal with than the authorization
system anyway.
In summary, don't spend too much time fiddling with
technically-enforced authorization controls unless you have a specific
reason to. It usually won't bring much tangible benefit, and there
are advantages to relying on human controls instead.
None of this should be taken to mean that the socially-enforced
restrictions themselves are unimportant, of course. It would be bad
for a project to encourage people to commit in areas where they're not
qualified. Furthermore, in many projects, full (project-wide) commit
permission has a special corollary status: it implies voting rights on
project-wide questions. This political aspect of commit areas is
discussed more in .
Receiving and Reviewing Contributions
These days the primary means by which
changes — code contributions, documentation
contributions, etc — reach a project is via "pull
requests" (described in more detail below), though some older projects
still prefer to receive a patch posted to a mailing list or attached
in a bug tracker. Once a contribution arrives, it typically goes
through a review-and-revise process, involving communication between
the contributor and various members of the project. At some point
during the process, if all goes well, the contribution is eventually
deemed ready for incorporation into the main codebase and is merged
in. This does not mean that discussion and work on the contribution
cease at that point. The contribution may well continue to be
improved, it's just that that improvement now takes place within the
project rather than off to one side. The moment when a code change is
merged to the project's main branch is when it becomes officially
part of the project. It is no longer the sole responsibility of
whoever submitted it; it is the collective responsibility of the
project as a whole.
Pull Requests / Merge Requests
A pull request (also called a
merge request) is a request
from a contributor to the
project for a certain change to be "pulled" (i.e., merged) into the
project — usually into the project's main branch, though sometimes
pull requests are targeted at some other branch.
The change is offered in the form of the difference between the
contributor's copy (or "clone") of the project and the project's own
copy. The two copies share most of their change history, of course,
but at a certain point the contributor's diverges — it
contains the change the contributor has implemented and that the
project does not have yet. The project may also have moved on since
the clone was made and contain new changes that the contributor does
not have, but these can be ignored for the purposes of discussion
here. A pull request is directional: it is for sending changes the
contributor has that the receiver does not, and is not about changes
flowing in the reverse direction.
In practice, the two copies are usually stored on the same
hosting site, and the contributor can initiate the pull request by
simply clicking a button. Creating a pull request automatically
creates a tracking ticket that everyone can see, so that a pending
pull request can use the same workflow as any other issue. Some
projects also have contributions enter through a collaborative code
review tool, such as https://en.wikipedia.org/wiki/Gerrit_%28software%29 or https://www.reviewboard.org/, and these days project hosting
sites include code-review features directly in their pull request
management interface anyway.
Pull requests are so frequent a topic of discussion that you
will often see people abbreviate them as "PR", as in "Yeah, your
proposed fix sounds good. Would you post a PR and assign it to me for
review please?" For newcomers, however, the term "pull request" is
sometimes confusing, however, because it sounds like it is a request by
the contributor to pull a change from someone else, when actually it
is a request the contributor makes to the project to pull the
change from the contributor. Some systems (e.g., GitLab) use the term
"merge request" to mean the same thing. I actually find that term
much more natural, but alas, "pull request", as popularized by GitHub,
appears to have won, and we all need to just get used to it. I'm not
bitter.
Commit Notifications / Commit Emails
Every commit to the repository — or every push
containing a group of commits — should generate a
notification that goes out to a subscribable forum, such as an email
sent to a mailing list. The notification should show who made the
change, when they made it, what files and directories changed, and the
actual content of the change.
The most common form of commit notifications is to just
subscribe to the repository itself, since the hosting platform will
send out notifications — usually by email, sometimes
also by other means — for interesting activity. Each
developer gets to customize what counts as interesting for them.
Alternatively, some projects have a mailing list dedicated to commit
notifications. Each commit (or push, or merge to the main branch)
sends an automatic email to that list. Note that this is a special
mailing list devoted to commit emails, separate from mailing lists to
which humans post. Whatever forms of commit notification your project
arranges, each notification should make it easy for developers to
proceed from there to reviewing that commit or changeset (see ).
Whether your project should use an email
list — either in addition to or instead of or some
other kind of subscribable notifications — depends
on the demographics of your
developers, but when in doubt, email is usually a good default choice.
The specifics of setting up notifications vary depending on the
version control system, but usually there's a script or other packaged
facility for doing it. If you're having trouble finding it, try
looking for documentation on hooks (or
sometimes triggers), specifically a
post-merge hook or post-commit
hook. These hooks are a general means of launching
automated tasks in response to receiving changes. The hook is fed all
the information about the merge, and is then free to use that
information to do anything — for example, to send out an
email.
With pre-packaged commit email systems, you may want to
modify some of the default behaviors:
Some commit mailers don't include the actual diffs in the
email, but instead provide a URL to view the change on the web using
the repository browsing system. While it's good to provide the URL,
so the change can be referred to later, it is also important that
commit emails include
the diffs themselves. Reading email is already part of people's
routine, so if the content of the change is visible right there in
the commit email, developers will review the commit on the spot,
without leaving their mail reader. If they have to click on a URL to
review the change, most won't do it, because that requires a new
action instead of a continuation of what they were already doing.
Furthermore, if the reviewer wants to ask something about the
change, it's vastly easier to hit reply-with-text and simply
annotate the quoted diff than it is to visit a web page and
laboriously cut-and-paste parts of the diff from web browser to
email client.
Of course, if the diff is huge, such as when a large body of
new code has been added to the repository, then it makes sense to
omit the diff and offer only the URL. Most commit mailers can do
this kind of size-limiting automatically. If yours can't, then it's
still better to include diffs, and live with the occasional huge
email, than to leave the diffs off entirely. Convenient reviewing
and commenting is a cornerstone of cooperative development, and much
too important to do without.
The commit emails should set their Reply-to header
to the regular development list, not the commit email list. That
is, when someone reviews a commit and writes a response, their
response should be automatically directed toward the human
development list, where technical issues are normally discussed.
There are a few reasons for this. First, you want to keep all
technical discussion on one list, because that's where people expect
it to happen, and because that way there's only one archive to
search. Second, there might be interested parties not subscribed to
the commit email list. Third, the commit email list advertises
itself as a service for watching commits, not for watching commits
and having occasional technical discussions.
Those who subscribed to the commit email list did not sign up for
anything but commit emails; sending them other material via that
list would violate an implicit contract.
Note that this advice to set Reply-to does not contradict the
recommendations in
. It's
always okay for the sender of a message to set
Reply-to. In this case, the sender is the version control system
itself, and it sets Reply-to in order to indicate that the
appropriate place for replies is the development mailing list, not
the commit list.
Bug Tracker
Bug tracking is a broad topic, and various aspects of it are
discussed throughout this book. Here I'll concentrate mainly on the
features your project should look for in a bug tracker, and how to use
them. But to get to those, we have to start with a policy question:
exactly what kind of information should be kept in a bug
tracker anyway?
The term bug tracker is misleading. Bug
tracking systems are used to track not only bug reports, but new
feature requests, one-time tasks, unsolicited patches — really
anything that has distinct beginning and end states, with optional
transition states in between, and that accrues information over its
lifetime. For this reason, bug trackers are also called
issue trackers, ticket
trackers, defect trackers,
artifact trackers, request
trackers, etc.
In this book, I'll generally use the word
ticket to refer the items in the tracker's
database, because that distinguishes between the behavior that the
user encountered or proposed — that is, the bug or
feature itself — and the tracker's ongoing
record of that discovery, diagnosis, discussion,
and eventual resolution. But note that many projects use the word
bug or issue to refer to
both the ticket itself and to the underlying behavior or goal that the
ticket is tracking. (Those usages are in fact more common than
"ticket"; it's just that in this book we need to be able to make this
distinction explicitly in a way that projects themselves usually
don't.)
The classic ticket life cycle looks like this:
Someone files the ticket. They provide a summary, an
initial description (including a reproduction recipe, if
applicable; see
for
how to encourage good bug reports), and whatever other
information the tracker asks for. The person who files
the ticket may be totally unknown to the project — bug
reports and feature requests are as likely to come from
the user community as from the developers.
Once filed, the ticket is in what's called an
open state. Because no action has
been taken yet, some trackers also label it as
unverified and/or
unstarted. It is not assigned to
anyone; or, in some systems, it is assigned to a fake
user to represent the lack of real assignation. At this
point, it is in a holding area: the ticket has been
recorded, but not yet integrated into the project's
consciousness.
Others read the ticket, add comments to it, and
perhaps ask the original filer for clarification on some
points.
The bug gets reproduced.
This may be the most important moment in its
life cycle. Although the bug is not actually fixed yet,
the fact that someone besides the original filer was able
to make it happen proves that it is genuine, and, no less
importantly, confirms to the original filer that they've
contributed to the project by reporting a real bug.
(This step and some of the others don't apply to
feature proposals, task tickets, etc, of course. But most
filings are for genuine bugs, so we'll focus on that
here.)
The bug gets diagnosed: its
cause is identified, and if possible, the effort required
to fix it is estimated. Make sure these things get
recorded in the ticket; if the person who diagnosed the
bug suddenly has to step away from it for a
while, someone else should be able to pick up where she
left off.
In this stage, or sometimes in the previous one,
a developer may "take ownership" of the ticket and
assign it to herself (
examines the assignment process in more detail). The ticket's
priority may also be set at this
stage. For example, if it is so important that it should
delay the next release, that fact needs to be identified
early, and the tracker should have some way of noting
it.
The ticket gets scheduled for resolution.
Scheduling doesn't necessarily mean naming a date by which
it will be fixed. Sometimes it just means deciding which
future release (not necessarily the next one) the bug
should be fixed by, or deciding that it need not block any
particular release. Scheduling may also be dispensed
with if the bug is quick to fix.
The bug gets fixed (or the task completed, or
the patch applied, or whatever). The change or set of
changes that fixed it should be discoverable from
the ticket. After this, the ticket is
closed and/or marked as
resolved.
There are some common variations on this life cycle. Often
a ticket is closed very soon after being filed, because it turns out
not to be a bug at all, but rather a misunderstanding on the part of
the user. As a project acquires more users, more and more such
invalid tickets will come in, and developers will close them with
increasingly short-tempered responses. Try to guard against the
latter tendency. It does no one any good, as the individual user in
each case is not responsible for all the previous invalid tickets; the
statistical trend is visible only from the developers' point of view,
not from the user's. (In
we'll look at
techniques for reducing the number of invalid tickets.) Also, if
different users are experiencing the same misunderstanding over and
over, it might mean that some aspect of the software needs to be
redesigned. This sort of pattern is easiest to notice when there is
a dedicated issue manager monitoring the bug database; see
.
Another common life event for the ticket to be closed
as a duplicate soon after Step 1. A duplicate
is when someone reports something that's already known to the project.
Duplicates are not confined to open tickets: it's possible for a bug to
come back after having been fixed (this is known as a
regression), in which case a reasonable course
is to reopen the original ticket and close any new reports as
duplicates of the original one. The bug tracking software keeps
track of this relationship bidirectionally, so that reproduction
information in the duplicates is available to the original ticket, and
vice versa.
A third variation is for the developers to close the ticket,
thinking they have fixed it, only to have the original reporter reject
the fix and reopen it. This is usually because the developers simply
don't have access to the environment necessary to reproduce the bug,
or because they didn't test the fix using the exact same reproduction
recipe as the reporter.
Aside from these variations, there may be other small details of
the life cycle that vary depending on the tracking software. But the
basic shape is the same, and while the life cycle itself is not
specific to open source software, it has implications for how open
source projects use their bug trackers.
The tracker is as much a public face of the project as the repository,
mailing lists or web pages.Indeed, as discusses, the bug tracker is actually the
first place to look, even before the repository, when you're trying to
evaluate a project's overall health. Anyone may file a ticket, anyone may look
at a ticket, and anyone may browse the list of currently open tickets.
It follows that you never know how many people are waiting to see
progress on a given ticket. While the size and skill of the
development community constrains the rate at which tickets can be
resolved, the project should at least try to acknowledge each ticket
the moment it appears. Even if the ticket lingers for a while, a
response encourages the reporter to stay involved, because she feels
that a human has registered what she has done (remember that filing a
ticket usually involves more effort than, say, posting an email).
Furthermore, once a ticket is seen by a developer, it enters the
project's consciousness, in the sense that the developer can be on the
lookout for other instances of the ticket, can talk about it with
other developers, etc.
This centrality to the life of the project implies a few things
about trackers' technical features:
The tracker should be connected to email, such that
every change to a ticket, including its initial filing, causes a
notification mail to go out to some set of appropriate
recipients. See
later in this chapter for more on this.
The form for filing tickets should have a place to record
the reporter's email address or other contact information, so she
can be contacted for more details.For logged-in
users whom the system already knows, these details are
automatically filled in, of course.
But if possible, it should not
require the reporter's email address or real
identity, as some people prefer to report anonymously. See for more on the importance of anonymity.
The tracker should have APIs. I cannot stress the
importance of this enough. If there is no way to interact with
the tracker programmatically, then in the long run there is no way
to interact with it scalably. APIs provide a route to customizing
the behavior of the tracker by, in effect, expanding it to include
third-party software. Instead of being just the specific ticket
tracking software running on a server somewhere, it's that
software plus whatever custom behaviors your
project implements elsewhere and plugs in to the tracker via the
APIs.
Also, if your project uses a proprietary ticket tracker,
as is becoming more common now that so many projects host their
code on proprietary canned hosting sites and thus use that
site's built-in tracker, APIs provide a way to avoid being
locked in to that hosting platform. You can, in theory, take the
ticket history with you if you choose to go somewhere else (you
may never exercise this option, but think of it as
insurance — and some projects have actually done
it).
Fortunately, the ticket trackers of most major hosting
sites have APIs.
Interaction with Email
Most trackers now have at least decent email integration
features: at a minimum, the ability to create new tickets by email,
the ability to "subscribe" to a ticket to receive
emails about activity on that ticket, and the ability to add new
comments to a ticket by email. Some trackers even allow one to
manipulate ticket state (e.g., change the status field, the assignee,
etc) by email, and for people who use the tracker a
lot — such as an issue manager (see ) — that can make a
huge difference in their ability to stay on top of tracker activity
and keep things organized.
The tracker email feature that is likely to be used by everyone,
though, is simply the ability to read a ticket's activity by email and
respond by email. This is a valuable time-saver for many people in
the project, since it makes it easy to integrate bug traffic into
one's daily email flow. But don't let this integration give
anyone the illusion that the total collection of bug tickets and their
email traffic is the equivalent of the development mailing list. It's
not, and discusses why this is
important and how to manage the difference.
Pre-Filtering the Bug Tracker
Most ticket databases eventually suffer from the same problem: a
crushing load of duplicate or invalid tickets filed by well-meaning but
inexperienced or ill-informed users. The first step in combating
this trend is usually to put a prominent notice on the front page of
the bug tracker, explaining how to tell if a bug is really a bug, how
to search to see if it's already been reported, and finally, how to
effectively report it if one still thinks it's a new bug.
This will reduce the noise level for a while, but as the number
of users increases, the problem will eventually come back. No
individual user can be blamed for it. Each one is just trying to
contribute to the project's well-being, and even if their first bug
report isn't helpful, you still want to encourage them to stay
involved and file better tickets in the future. In the meantime,
though, the project needs to keep the ticket database as free of junk
as possible.
The two things that will do the most to prevent this problem
are: making sure there are people watching the bug tracker who have
enough knowledge to close tickets as invalid or duplicates the moment
they come in, and requiring (or strongly encouraging) users to confirm
their bugs with other people before filing them
in the tracker.
The first technique seems to be used universally. Even projects
with huge ticket databases (say, the Debian bug tracker at
https://bugs.debian.org/, which
contained 996,003 tickets as of this writing) still arrange things so that
someone sees each ticket that comes in. It may be
a different person depending on the category of the ticket. For
example, the Debian project is a collection of software packages, so
Debian automatically routes each ticket to the appropriate package
maintainers. Of course, users can sometimes misidentify a ticket's
category, with the result that the ticket is sent to the wrong person
initially, who may then have to reroute it. However, the important
thing is that the burden is still shared — whether the user
guesses right or wrong when filing, ticket watching is still
distributed more or less evenly among the developers, so each ticket is
able to receive a timely response.
The second technique is less widespread, probably because it's
harder to automate. The essential idea is that every new ticket gets
"buddied" into the database. When a user thinks he's found a problem,
he is asked to describe it on one of the mailing lists, or in a chat
room, and get confirmation from someone that it is indeed a bug.
Bringing in that second pair of eyes early can prevent a lot of
spurious reports. Sometimes the second party is able to identify that
the behavior is not a bug, or is fixed in recent releases. Or she may
be familiar with the symptoms from a previous ticket, and can prevent a
duplicate filing by pointing the user to the older ticket. Often it's
enough just to ask the user "Did you search the bug tracker to see if
it's already been reported?" Many people simply don't think of that,
yet are happy to do the search once they know someone's
expecting them to.
The buddy system can really keep the ticket database clean, but
it has some disadvantages too. Many people will file solo anyway,
either through not seeing or through disregarding the instructions
to find a buddy for new tickets. Thus it is still necessary for
some experienced participants to watch the ticket database.
Furthermore, because most new
reporters don't understand how difficult the task of maintaining the
ticket database is, it's not fair to chide them too harshly for
ignoring the guidelines. The watchers must be vigilant,
yet exercise restraint in how they bounce unbuddied tickets back to
their reporters. The goal is to train each reporter to use the
buddying system in the future, so that there is an ever-growing pool
of people who understand the ticket-filtering system. On seeing an
unbuddied ticket, the ideal steps are:
Immediately respond to the ticket, politely thanking the user
for filing, but pointing them to the buddying guidelines
(which should, of course, be prominently posted on the web
site).
If the ticket is clearly valid and not a duplicate, approve it
anyway, and start it down the normal life cycle. After all,
the reporter's now been informed about buddying, so there's
no point closing a valid ticket and wasting the work done so
far.
Otherwise, if the ticket is not clearly valid, close it, but
ask the reporter to reopen it if they get confirmation from
a buddy. When they do, they should put a reference to the
confirmation thread (e.g., a URL into the mailing list
archives).
Remember that although this system will improve the signal/noise
ratio in the ticket database over time, it will never completely stop
the misfilings. The only way to prevent misfilings entirely is to
close off the bug tracker to everyone but
developers — a cure that is almost always worse than
the disease. It's better to accept that cleaning out invalid tickets
will always be part of the project's routine maintenance, and to try
to get as many people as possible to help.
See also
.
Real-Time Chat Systems
Many projects offer real-time chat rooms in which developers can
have fast-turnaround conversations with each other and with users.
Such conversations often precede a bug report or some other kind of
more formal, tracked contribution.
For decades, the standard real-time chat system for open source
projects was Internet Relay Chat
(IRC), which predates the World Wide Web and
uses a text-based interface and command language. Starting around
2014-2015, a number of open source projects began trying out newer,
web-browser-friendly chat systems, in particular the open source
platforms https://zulip.org/,
https://mattermost.org/,
https://rocket.chat/, and
the MatrixMatrix is actually a protocol and an open
source reference implementation. The protocol is supported by an
increasing number of chat applications, including IRC as well as more
modern systems. In the words of Julian Foad in https://issues.apache.org/jira/browse/SVN-525#comment-17286477,
"Matrix is a 'spiritual successor' to IRC, and truly Open, federated,
and standardized. ... In my opinion Matrix is very much the Right Way
forward for all sorts of reasons." For more information, see https://matrix.org/ and https://en.wikipedia.org/wiki/Matrix_(protocol).
protocol. (A few projects also experimented with the proprietary
online chat service Slack when it was new, but Slack hasn't been
widely adopted by open source projects and I wouldn't recommend it for
them. In a post written when that early experimentation was still
under way, Drew DeVault lists some of the reasons why Slack isn't
suitable: https://drewdevault.com/2015/11/01/Please-stop-using-slack.html.
I don't know whether any of these new systems will emerge as the
long-term default choice for open source projects. Try looking at the
open source chat systems used by similar projects and use that as
guidance in choosing yours. Matrix compatibility (sometimes referred
to as Matrix "bridging" or having a "Matrix bridge") is a good
property to keep in mind, and if possible IRC bridging too, since some
developers still like to use their IRC clients with non-IRC server
applications.
Chat Rooms and Growth
A chat server is usually divided into virtual chat
rooms. The chat application may call these "channels", or
"streams", or something else, but the concept is generally the same: a
chat room is a shared space in which everyone who is in that room can
see every message posted to the room. Every project maintains a
certain set of advertised, topic-specific public rooms; these are the
entry points into chat for new participants.When two
or a few users wish to chat privately, it is sometimes said that they
create a "private room". Such rooms are usually
temporary. Some projects maintain a "welcome" or
"general" room specifically for newcomers to start out in, with
current project members watching that room in order to greet new
arrivals, but it's also fine to just have new people come directly
into the regular rooms to ask their questions too.
Exactly how many rooms to have, and for what topics, will depend
on your project, but it's best to start out with a small number of
rooms — even just one — and only add
more when it becomes clearly necessary. Much of the value of
real-time chat comes from people being together in the same rooms and
serendipitously seeing conversations between others. discusses when and how to divide into more
rooms.
Nick-Flagging and Notifications
Users who are new to such chat systems usually need some time to
learn the conventions of real-time written communications. While each
project has its own local customs, there is at least one convention
that seems to be common in almost all projects:
nick-flagging for notification.
A user's nick is their nickname, their
handle in the chat system. It might or might not be some form of
their real name, but in any case it is how they are identified in
chat. When you want to speak to that person, you prefix your message
with her handle (perhaps followed by a separator character such as a
colon). Her chat client, upon seeing her handle used in a message,
notifies her by whatever means she has
configured — perhaps by flashing a notification popup
on her screen (even when she does not have the chat window in front of
her right then), or perhaps via an audible signal.
This notification only happens for messages that contain her
handle, not for other messages. She may still see those other
messages go by if she happens to be in that chat room right
then — developers often "lurk" in a chat room just to
see what's going on — but thanks to nick-flagging she
can easily tell the difference between messages addressed to her and
other messages. A message can contain multiple nicks, of course, in
which case each of the corresponding people would be notified.
The ability for users to separate the conversations they are
involved in from other conversations is key to successful use of
real-time chat in open source projects. It is how a large number of
developers can be in a "room" and all talk "together" without getting
their different streams of conversation entangled. Each developer can
tell which messages are specifically requesting her attention and
which ones are not. It is analogous to an observation Deaf people
sometimes make about the advantage of communicating with sign language
instead of spoken language in a crowded room: as long as you have a
clear line of sight to your interlocutor, the "noisiness" of the room
(whether with signed or spoken language) does not interfere much with
your ability to maintain the conversation. Similarly, a chat room can
be very busy, but as long as everyone follows the convention of
nick-flagging, people can simultaneously participate in their own
chats and keep an eye on whatever else they're interested in, at least
to the limit of their attentional capacity.See http://www.rants.org/2013/01/09/the-irc-curmudgeon/ for a
more detailed examination of nick-flagging and some
examples.
Paste Rooms and Paste Sites
Normally, the fact that a chat room is a shared space is a good
thing, as it allows people to jump into a conversation when they think
they have something to contribute, and allows spectators to learn by
watching. But it becomes problematic when someone has to provide a
large quantity of information at once, such as a large error message
or a transcript from a debugging session, because pasting too many
lines of output into the room may disrupt other conversations.
One solution is to have a dedicate chat room just for pastes.
The user posts their transcript there, then grabs the URL to that
specific messageEvery message posted in an online chat
has its own unique URL permalink, just as every comment in, say, a bug
ticket does. See for more about
this principle and its implications. and posts the
URL in the original chat room, nick-flagging whoever should see
it.
Another solution is to set up a separate
pastebin site, which is separate from the chat
service operates essentially as described above: the user posts their
transcript to the paste site to create a new paste,
which in turn has its own unique URL, which the user then presents
back in the chat room. Historically there have also been many public
pastebin sites, so you might not need to set up a dedicated one for
your project, but note that public pastebin sites tend to be
short-lived (my guess is that they get spammed a lot and end up being
expensive to maintain). As of this writing in early 2022, https://hastebin.com/ is up and
running. If you do need to set up your own, there are many open
source codebases available (including the code that backs hastebin:
see https://hastebin.com/about.md.
Chat Bots
Chat rooms can have non-human members too, so-called
bots, that provide automated services such as
answering frequently-asked questions. Typically, a bot is addressed
just like any other member of the channel, that is, commands are
delivered by "speaking to" the bot. No special server privileges are
required to run a bot. A bot is just like any other user joining a
channel.
People who spend enough time in chat learn how to manipulate
these bots and use them to help others. For example, when one user
comes into a room and asks a common question, another more experienced
user may issue a terse command to the local bot telling it to provide
that user with a specific detailed answer that the bot has been
previously told to remember.
If your chat rooms tend to get the same questions over and over,
I highly recommend setting up a bot. Only a small percentage of
channel users will acquire the expertise needed to manipulate the bot,
but those users will answer a disproportionately high percentage of
questions, because the bot enables them to respond so much more
efficiently. The exact command set and behaviors will differ among
bot implementations; unfortunately, the diversity of bot command
languages seems to be rivaled only by the diversity of wiki
syntaxes.
Commit Notifications in Chat
One particular kind of bot (also known as an "integration")
watches the project's version control repository and broadcasts commit
activity to the relevant chat rooms as it happens. While this offers
less technical utility than subscription-based commit notifications
(see ), since interested
observers might or might not be around when a particular commit pops
up in the room, it is of immense social utility.
It gives people the sense of being part of something alive and
active — they see progress happening right before
their eyes. Because the notifications appear in a shared space,
people in the chat room will often react in real time, congratulating
the committer, or asking a question related to the commit, or even
reviewing the commit and commenting on it on the spot.
The technical details up of setting this up are beyond the scope
of this book, but I recommend learning how to enable it in your
project's chat platform. It's worth the effort. Most of the major
hosting sites make this integration fairly easy to set up. In
addition to "integration", some key words to try in a search are
"hook", "trigger", and "extension".
Wikis
A well-run wiki can be a wonderful thing for users and
developers. Wikis offer the lowest possible barrier-to-entry for
those seeking to contribute to the project. You just click and
edit — the wiki software will keep track of the
change, make sure you get credited, notify anyone who needs to be
notified, and immediately publish the new content to the world.
However, wikis also require some centralized effort to maintain.
When open source software project wikis go bad, they usually go bad
for the same reasons: lack of consistent organization and editing
(leading to a mess of outdated and redundant pages) and lack of clarity
on who the target audience is for a given page or section.
From the outset, try to have a clear page organization strategy
and even a pleasing visual layout, so that visitors (i.e., potential
editors) will instinctively know how to fit their contributions in.
Make sure the intended audience is clear at all times to all editors.
Most importantly, document these standards in the wiki itself and
point people to them, so editors have somewhere to go for guidance.
Too often, wiki administrators fall victim to the fantasy that because
hordes of visitors are individually adding high quality content to the
site, the sum of all these contributions must therefore also be of
high quality. That's not how collaborative editing works. Each
individual page or paragraph may be good when considered by itself,
but it will not be good if embedded in a disorganized or confusing
whole.
In general, wikis will amplify any failings that are present
from early on, since contributors tend to imitate whatever patterns
they see in front of them. So don't just set up the wiki and hope
everything falls into place. Prime it with well-written content, so
people have a template to follow.
The shining example of a well-run wiki is Wikipedia, of course,
but in many ways it's also a poor example because it gets so much more
editorial attention than any other wiki in the world. Still, if you
examine Wikipedia closely, you'll see that its administrators laid a
very thorough foundation for cooperation. There
is extensive documentation on how to write new entries, how to
maintain an appropriate point of view, what sorts of edits to make,
what edits to avoid, a dispute resolution process for contested edits
(involving several stages, including eventual arbitration), and so
forth. It also has authorization controls, so that if a page is
the target of repeated inappropriate edits, senior editors can lock it down
until the problem is resolved. In other words, they didn't just throw
some templates onto a web site and hope for the best. Wikipedia works
because its editors give careful thought to getting thousands of
strangers to tailor their writing to a common vision. While you may
not need the same level of preparedness to run a wiki for a free
software project, the spirit is worth emulating.
Wikis and Spam
Never allow open, anonymous editing on your wiki. The days when
that was possible are long gone now; today, any
open wiki other than Wikipedia will be covered completely with spam in
approximately 3 milliseconds. (Wikipedia is an exception only because it
has an unusually large number of editors willing to clean up spam
quickly, and because it has a well-funded organization behind it
devoted to fighting spam using various large-scale monitoring
techniques not practically available to smaller projects.)
All edits in your project's wiki should come from registered
users; if your wiki software doesn't already enforce this by default,
then configure it to enforce that. Even then you may need to keep
watch for spam edits from users who registered under false pretenses
for the purpose of spamming.You may be able to allow
editing by non-registered users if you put some spam countermeasures
in place. For example, the Emacs Wiki (https://www.emacswiki.org/)
allows editing by anyone, but to submit your edit you must answer a
question that a bot is unlikely to be able to answer
accurately.
Choosing a Wiki
If your project is on GitHub or some other free hosting site,
it's usually best to use the built-in wiki feature that most such
sites offer. That way your wiki will be automatically integrated with
your repository or other project permissions, and you can rely on the
site's user account system instead of having a separate registration
system for the wiki.
If you are setting up your own wiki, then you're free to choose
which one, and fortunately there are plenty of good free software wiki
implementations available. I've had good experience with DokuWiki
(https://www.dokuwiki.org/dokuwiki), but there are many others. There is
a wonderful tool called the Wiki Choice Wizard at http://www.wikimatrix.org/ that allows
you to specify the features you care about (an open source license can
be one of them) and then view a chart comparing all the wiki software
that meets those criteria. Another good resource is Wikipedia's own
page comparing different wikis: https://en.wikipedia.org/wiki/Comparison_of_wiki_software.
I do not recommend using MediaWiki (https://www.mediawiki.org) as the wiki
software for most projects. MediaWiki is the software on which
Wikipedia itself runs, and while it is very good at that, its
administrative facilities are tuned to the needs of a site unlike any
other wiki on the Net — and actually not so well-tuned
to the needs of smaller editing communities. Many projects are
tempted to choose MediaWiki because they think it will be easier for
users who already know its editing syntax from having edited at
Wikipedia, but this turns out to be an almost non-existent advantage
for several reasons. First, wikis in general, including Wikipedia,
are tending toward rich-text in-browser editing anyway, so that no one
really needs to learn the underlying wiki syntax unless they aim to be
a power user. Second, many other wikis offer a MediaWiki-syntax
plugin, so you can have that syntax anyway if you really want it.
Third, for those who will use a plaintext syntax instead of rich-text
editing, it's better to use a standardized generic markup format like
Markdown (https://daringfireball.net/projects/markdown/), which is available in
many wikis either natively or via a plugin, than to use any flavor of
wiki syntax. If you support Markdown, then people can edit in your
wiki using the same markup syntax they already know from GitHub and
other popular tools.
Translation Infrastructure
Various online platforms now exist to help automate the
organization and integration of human-language translation work in
open source projects. "Translation work" here means not just the
process of translating the software's documentation, but also its
run-time user interface, error messages, etc into different languages,
so that each user can interact with the software in their preferred
language. (See for more about
this process.)
It is not strictly necessary to use a separate translation
platform at all. Your translators could work directly in the
project's repository, like any other developer. But because
translation is a specialized skill, and translators' methods are
basically the same from project to project, the process is quite
amenable to being made more efficient through the use of dedicated
tools. Web-based translation platforms make it easier for translators
to get involved by removing the requirement that a translator (who may
have linguistic expertise but not development expertise) be
comfortable with the project's development tools, and by providing a
working environment that is specially optimized for translation rather
than for general code development.
Until 2013, the obvious recommendation for a platform would have
been https://transifex.com/, which was both the premier software
translation site and was open source software itself. However, its
main corporate sponsors switched to a closed, proprietary version in
March 2013,See https://github.com/transifex/transifex-old-core/issues/206#issuecomment-15243207
for more. and development of the open source
version stopped then. Transifex still offers zero-cost service for
open source projects, as does a competing proprietary platform called
Lokalise. But your translators may prefer to invest their time in
learning a fully open source platform, and there are several to choose
from: https://weblate.org/,
http://zanata.org/, https://translatewiki.net/,
and https://translations.launchpad.net/ (and there are probably
others I don't know about, so look around and ask in other translation
communities).
Internationalization (i18n) and Localization (l10n)
The process of adapting software user interfaces for different
groups of humans involves two terms that are easily confused:
"internationalization" and "localization".
Internationalization refers to the
process of putting software source code into a form that allows the
program to be translated (or "localized" — see below).
It includes, among other things, marking all user-visible strings
(interface texts, error messages, etc) so that they can be
automatically replaced by translated versions when the software is
deployed in a "locale". The translations are supplied by humans, but
internationalization is what allows those translations to be
automatically integrated into the software.
Thus, internationalization does not involve performing any
actual translation. Rather, it's about putting the program into a
form that allows translators, or "localizers", to get to work.
i18n is a common abbreviation for
"internationalization", since the word is so long to type. The "18"
refers to the number of letters between the initial "i" and then final
"n".
Localization, meanwhile, refers to
supplying an actual translation into a specific language, as well as
to other changes needed for that audience (for example, conversion of
measurement units, monetary units, etc). Because it may involve more
than just language change, the term is "localization" rather than
"translation", and the destination — the intended
audience — is called a locale.
A locale does not always correspond to geographic area or a political
grouping. Localizing a program for Yiddish, for example, doesn't say
anything about where it will be run nor by whom, other than that they
know Yiddish.
l10n is likewise a common abbreviation
for "localization", using the same scheme as "i18n".
See https://en.wikipedia.org/wiki/Internationalization_and_localization
for more information about i18n and l10n.
Social Networking Services
Perhaps surprisingly for such social endeavors, open source
projects typically make only limited use of what most people think of
as "social networking" services. But this seeming omission is really
a matter of definition: most of the infrastructure that open source
projects have been using for decades, since long before "social
networking" became a recognized term, is actually
social networking software even if it isn't called that. The reason
open source projects tend not to have much presence as
projects on, say, Facebook is just that the services Facebook
offers are not well-tuned to what open source projects need. On the
other hand, as you might expect, the infrastructure these projects
have been using and improving for many years is
quite well-tuned to their needs.
Most projects do use Twitter and similar microblog services,
because sending out short quips and announcements that can be easily
forwarded and replied to is a good way for a project to have
conversations with its community; see LibreOffice's "@AskLibreOffice"
tweet stream at https://twitter.com/AskLibreOffice for an example of this. Projects
also sometimes use services such as https://www.eventbrite.com/ and https://www.Meetup.com/ to arrange in-person
meetings of users and developers.
But beyond lightweight services such as those, most free
software projects do not maintain a large presence on mainstream
social media platforms (though individual developers sometimes do, of
course, and often discuss the project there). The reward the
project gets in exchange for that investment of time and attention
appears not to be high enough to be worth the effort.