<posted: 2019-09-17>
<updated: 2019-09-18>
As a programmer working with scientists, I am asked with some regularity, “What is Git, and why would I use it?” Far more frequently, however, I find myself asking others, “Why aren’t you using Git (or some other Version-Control System)?”. It is my hope that this becomes a page I can send someone when the inevitable reply arrives, “What is Git?”.
Teaching users how to use Git is beyond the scope of this post. It is my hope that this can serve as a good explanation of the features of and motivation for Git, which should provide the necessary context to make digesting other tutorials on Git easier and faster.
Since the beginning of Time (January 1st, 1970 00:00 GMT), programmers have had to deal with managing a lot of files, many of which would undergo numerous iterative changes. To make matters worse, many of these programmers would have the audacity to simultaneously work together on projects, making it necessary for them to track their respective changes and combine them together into the developing project. This is a lot of work, and very important to get right, so the first Version Control systems were developed by programmers, for programmers.
Since this article isn’t really about the history of VCS, or even about any VCS other than Git, I’m going to skip straight to what Git does and why we use it. In the broadest possible sense, Git is a sophisticated way of tracking changes to files. If you have ever used Track Changes in a Word document, then you’re familiar with the concept of receiving a batch of changes from a collaborator, and being able to inspect the document before and after the new changes are applied. If you haven’t… well, it does exactly what I just described.
Git takes this concept and runs with it, but for a whole directory of files instead of just a single one. Sets of changes are bundled together into a commit. A commit might have changes like “Change this sentence to that” and “Create a new file called this with that text”, etc. A commit can be large or small, but ideally it should be a somewhat cohesive sets of changes - think “Adding a new simulation type” instead of “All April and May 2019 changes”.
Tracking changes in this way unlocks another very important feature of Git, virtually for free - it’s trivial to revisit old versions. It becomes easy to answer very common questions like “what changed since the last version?” and “Did this error exist 2 versions ago, or is it new?”
Git’s usage also makes collaborating with others relative easy. You can integrate a commit from someone else just as easily as you integrate one of your own, and it comes with many tools for handling conflicts - ie, determining what should happen when both you and a collaborator just changed the same file in mutually incompatible ways.
I can hear a number of you through the computer saying “Great, Git helps me track, merge, share, and roll back changes. But what should I actually use it for? If I’m working by myself, isn’t that just the same as a backup?”
I’ll try to tackle the natural questions in a useful order.
Broadly speaking, Git is ideal for projects which contain text files that are being modified or improved over time. Git can work on binary files (like images), but its strengths lie in tracking changes to text files. Viewing the history of edits of a raw text file (like MatLab code, a paper draft written in LaTeX, a simple text file, etc) will give you a clear picture of how that text file has been changed. But changes to binary files like images, PDFs, or even Microsoft Word documents (though there are workarounds for this) are going to be inscrutable without jumping back to the old version and opening it up to take a look. You can still do this, of course, it’s just marginally less helpful than being able to see how it has changed.
This is a common comparison, but it is not quite right - at least, it’s not complete. You could definitely use Git to track the history of a file, allowing you to recover old versions. But if the hard drive dies, both your current and old versions are gone unless you backed them up elsewhere. An incremental backup system, like Apple’s TimeMachine, lets you recover old versions of files, but with far less clarity and control. And using Git with an incremental backup is indeed redundant. A backup is for recovery, so you don’t lose your files in the case of a disaster, whereas Git is for tracking and sharing changes. However Git does make it very easy to use an external Git server/service, like GitHub, to create a proper backup of your work (security experts will quite reasonably caution you about exclusively relying on a 3rd party for backups, but that’s beyond the scope).
Not so fast… The best way to use Git is to track individual projects, separately. A Git “project” is called a Repository (sometimes repo for short). Indeed, you could put all of your work in one folder, make that folder a Git repo, and call it a day, but this would fail to capitalize on some of the strengths of Git. Instead, if you made a single Git repository for a single MatLab simulation you have, or a paper you’re working on, then all of the changes in the history are relevant to that project, and it’s significantly easier to Share, Examine old versions, and Fork (if you want to make some incompatible changes and take the project in a mutually-exclusive direction).
Git is the program (initially created by Linus Torvalds in 2005 when he was working on the Linux Kernel) that does all this work in tracking, handling, and sharing changes. GitHub is the Website/service that (among other things) allows you to backup, share, and manage some Git repositories on the internet. You can have private Git repositories on GitHub that only you (or you and a few people you select) can access, but it should go without saying that you should be careful about uploading sensitive files to the internet. You can use Git without GitHub, but there is no GitHub without Git.
If you’re already comfortable using a command-line interface (that is, typing into a terminal or command prompt), then getting started is incredibly easy and there are many tutorials out there that will walk you though it. For everyone else, there are a lot of tools designed to make the process simpler. I tend to recommend GitHub Desktop to new users, as it makes some of the common features relatively easy to use. I’ll also list some useful learning resources at the end of this post.
A quick reference on some common terms used when discussing Git.
A single Git project. It tracks the history of all commits, and is completely decentralized. This means that every copy of the repository has a complete view of the history, not just a snapshot of the current state.
A single bundle of changes within a repository. A commit contains all of the changes, the author who created the commit, the date it was created, and usually a Title and Description of what the commit contains. The History of a repository is all of the Commits that brought it from just an empty directory to the state it’s in now. Making changes to files within a repository doesn’t automatically create a Commit - you do that deliberately when you’re ready by adding and commiting the files you wish to be included, and providing the description.
A branch in Git is like a branch in the road of a repository’s history (indeed, the term wasn’t chosen arbitrarily). The main, primary branch in a repository is usually called the master
branch. Its history is just a linear sequence of changes to create the current version. However, if you went back to an old version of the repository and make different changes, you could preserve that as a new branch
, and freely switch between them. Frequently in shared programming projects, the master
branch is where the current stable code lives that outsiders should use, and new features that aren’t yet complete are developed in other branches, and then merged into the master
branch when they’re ready.
Merging is where the changes in two different branches are reconciled and one branch is updated to include them all. If I committed some changes that tweaked some of the math of a simulation, and you created a commit with some changes to make it faster and more efficient, then the process of me updating my repository with your changes would involve merging.
The process of downloading a repository, usually running on a server (like GitHub). A cloned repository is generally “meant to” stay up-to-date with its source.
The process of spinning-off a new repository that’s a copy of some point in the history of an existing repository (called the “upstream” repository). Frequently repositories are forked so that the Forker can have control over changes. (See Pull Request)
The process of uploading your new commits to another copy of the repository, usually running on a server (like GitHub).
The process of downloading new commits from another copy of the repository, usually running on a server (like GitHub).
Not every change that exists in a repository must be included in the next commit you create. Before creating a commit, each of the modified files that you’d like the commit to actually include are added to a list of the pending changes. If you changed 2 files and deleted a third, you could choose to “add” the changes, but not the deletion, to the next commit. This would work whether you just wanted to include the deletion as part of a separate commit, or even if you didn’t wish to include it at all.
A bit complicated to appropriately cover here, but essentially it is the process of updating commits as if they came later. If Commit A modified the first line of “TestFileA.txt”, and Commit B renamed the file to “TestFileB.txt”, then if you Rebase Commit A onto Commit B, then Commit A would say that it modified the first line of “TestFileB.txt”.
A GitHub-specific concept, this is a recommendation that specific changes be made to a repository. For example, if you identified a bug in a public repository you were using, you could fork the repository, create a commit that fixes the bug, and then open a pull request for the original repository with the commit and an explanation of the issue. If the owner accepts the pull request, then your changes would be merged. This is how public code and documents can be improved without devolving into narcissistic anarchy.
A special file that you can add into a Git repostory, always with the name “.gitignore”. Unsurpisingly this file tells Git what files to ignore. Git will act like any file or directory described in it doesn’t exist, whether described literally (Examples/DataFile.txt
) or with a wild card pattern (all csv files with *.csv
). This is useful to avoid, for example, adding temporary autosave files like MatLab’s “scriptName.m~” (with the pattern *.m~
), or compiled output (sometimes */obj
).
This section is to explain how the process of using Git might look, but is not by any means a proper tutorial on its usage. For convenience and reference, proper Git commands will be highlighted.
To start with, an existing project directory is init
ialized with Git, optionally a relevant .gitignore
file is created, all the “modified” files (literally everything, in this case) are add
ed, and then the changes are commit
ed.
More changes are made to the project files. When complete, they are add
ed to form the list of pending changes and those changes are commit
ed.
At a later point, it becomes necessary to share the project with a collaborator, or perhaps it’s just necessary to do some work on a different computer. A new repository is created on GitHub. In the existing repository, the remote
is configured, and then all of the commits so far are push
ed.
On this secondary machine, the GitHub repository is clone
d. Then some of the files are modified, add
ed, and commit
ed. Finally, this new commit is push
ed to GitHub.
Now, the original project can pull
the new commits from GitHub, adding the new commit to the original computer.
If the changes were bad, the log
could be inspected, a desired commit selected, and then one could checkout
said commit, resetting the repository to this earlier state. One could even just revert
the bad commits, which would automatically create new commits that undo the actions of the undesired commits (not merely purge them from the history).
When it’s time to take the plunge and learn how to use Git, here are some learning resources I can recommend:
Thanks for taking the time to read this. I will try to expand it as more useful details occur to me. If this was useful to you, or you feel I missed some important details, please let me know.
The contents of this blog are open source and pull request are welcome. If anything I said is wrong, or you see some way that this could be more clear, send me a pull request or shoot me an email (Trevor.Stavropoulos@gmail.com) and I’ll integrate your changes and give you credit.
Special Thanks To: