Search This Blog

Tech Book Face Off: Database Design for Mere Mortals Vs. Seven Databases in Seven Weeks

In my quest to fill in some of the holes in my software development knowledge, I picked up a couple books on databases to read through. I somehow made it this far without studying anything in depth about databases, and up until now, when I needed to interact with a database I searched for what I needed at the moment and muddled my way through.

When I want to learn about a topic, I like to go through at least two books to get different perspectives and hopefully get a broader and more thorough understanding of the topic. This time I was not disappointed, and I picked two dramatically different books that worked quite well together to cover the vast subject of databases. Database Design for Mere Mortals is a book about how to design a database in a simple, straightforward way. Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement is exactly what it sounds like, a book that runs through the features, advantages, and disadvantages of seven different modern databases to give you an overview of what's currently available and what these databases are capable of. Let's take a closer look at what each of these books has to offer.

Database Design for Mere Mortals front coverVS.Seven Databases in Seven Weeks front cover

Database Design for Mere Mortals


Databases are the foundation of many types of software applications, and I wanted to get a working understanding of how to design a database well. Michael Hernandez does an excellent job in this book laying out a process for designing a database with good data integrity and consistency, two important goals when designing a database. His writing style is direct, clear, and pleasant, and I found this book surprisingly easy to read considering it's covering a topic that threatens to be utterly dry and boring.

The book is split into three parts: an overview of relational database design, the details of the design process, and some other database design issues. The first part covers the history of database models, what a relational database is, the design objectives for a database design process, and the terminology of relational databases. The relational database that we know and love was conceived all the way back in 1970 in more or less it's current form with tables, fields, and relationships. Like with so many of the computer innovations of the 60s and 70s, we are standing on the shoulders of giants today with relational databases.

The second part of the book contains the majority of the content and lays out the details of the database design process. The process starts with defining the goals and objectives of the database and analyzing the current database that's in use, if there is one. This step, as well as most of the other steps in the process, involves numerous interviews with various stakeholders of the database, and example interviews are spread throughout this part of the book. The process continues with establishing the table structure of the database, defining the different types of keys for the tables, writing field specifications, designing table relationships, establishing business rules, defining views to represent database queries, and reviewing data integrity.

The third part of the book wraps up with some bad design practices to watch out for and advice for when you might think about bending or breaking the rules of good database design. Throughout the book, Hernandez focuses on the relational database and doesn't even bring up other modern database models, like the models covered in Seven Databases in Seven Weeks, but the process he lays out could be adapted to other types of databases fairly easily.

Early in the book Hernandez takes a decidedly database-centric view of application design by assuming the database is the starting point of an application, and the data structure should be established first with the rest of the application being built up around it. It's worth noting that this is but one way to design an application, and it normally results in an application with a data-centric feel to it. Another way to design an application is to start with the user interface (UI) and user experience (UX) and design the database to work with the application structure that results. Such an application will have a much more user-centric look and feel. Both approaches have their own benefits and issues, and of course a real project will likely be a hybrid of these two extremes because the design of the database and user interface is not going to be a serial process.

As for the database design process itself, I was surprised at how similar it is to good software design. Much of good database design boils down to eliminating duplication in the database, choosing descriptive names for everything in the database, and making the database flexible enough to change easily. Good software design will follow these same principles. Most of the book felt very familiar except that the terminology referred to databases instead of programming languages.

The reason that duplication in a database is a bad thing is that if the same field is replicated in multiple tables, then the data for those fields needs to be constantly kept in sync. It becomes much more likely that the data will be updated in one place and not the other, causing data integrity problems. The process of eliminating duplication is referred to as normalizing the database, but using Hernandez' design methodology, duplication is never allowed into the database, so normalization shouldn't be necessary. The equivalent principle in software design is the DRY principle, and it is usually achieved through refactoring.

Naming is one of the hardest problems in software design, and it seems from this book, the same is true of database design. At least half of the book is about how to name various database structures, and many times naming a table or field will provide clues as to how it should be changed to make the design better. One place where naming can reveal design issues is when a table has a multi-part field. A multi-part field is a field with values that have more than one distinct subject. The canonical example is an address field that contains full addresses as string values. An address consists of a street name, city, state, and zip code. A better design would have separate fields for each of these parts of the address.

While the advice on naming was generally good, and I learned quite a lot about subtle naming issues in database design, some of the advice struck me as odd. Hernandez recommends using an abbreviated form of the table name as a prefix to disambiguate field names that would otherwise be the same across tables, like when there are address fields for employees, customers, and suppliers in their respective tables. He advises having names like "EmpAddress," "CustAddress," and "SuppAddress." Later he abbreviates things like "Classes" to "Cls" and "Student" to "Std," which makes even less sense to me. I would think this practice would add confusion, and adds mental effort by having to do the translation in your head every time you use these abbreviated fields. I've come to appreciate the advantages of not abbreviating names in programming, and the cost of a little extra typing is not terribly significant. I would prefer the unabbreviated names for clarity.

Another issue I had with this design process was how much it seemed like a Big Up-Front Design (BUFD) process. The entire database design is decided in the beginning through meetings and interviews, and Hernandez recommends everything be documented in paper folders. And I mean everything:
Once you’ve identified which web pages you need to review, take a screenshot of each page. Copy the screenshots into a word processing document, print the document, and then store the document in a folder for later use.
It seemed quite rigid and optimistic to me with how much requirements can change during a development project. Maybe there's something about database design that requires this type of BUFD, but I can see why programmers can get frustrated with the database team if this is how they have to operate. I wonder if there's a way to make the design process more flexible and (dare I say) agile without sacrificing data integrity and consistency in the database design.
Despite these misgivings, Database Design for Mere Mortals was an excellent introduction to database design overall. Hernandez' approach was simple, straightforward, and methodical, and he did a great job of laying out a process that should be accessible to most people. In his own words, database design isn't hard:
I’ve always believed that you shouldn’t have to be a rocket scientist in order to design a database properly. It should be a relatively straightforward task that can be performed by anyone possessing a good amount of common sense. As long as you follow a good database design method, you should be able to design a sound and reliable database structure.
I always thought that to be true, from what little I knew about databases, and it's good to have the confirmation from an expert. I definitely recommend this book to anyone looking for a good way to design a database effectively. Maybe you can even find a way to make it more adaptable to change.

Seven Databases in Seven Weeks


I've been wanting to read one of these Seven X in Seven Weeks books for a while now, and this is the first one I was able to get to. It was excellent. I'm amazed by how much Eric Redmond and Jim Wilson were able to fit into 350 pages, and it's entirely coherent while being an immensely enjoyable read. They do actually cover seven databases in reasonable detail in this book, dedicating roughly 45 pages to each database.

It's an interesting exercise to think about. If you had to cover a database in 45 pages, what would you include? What would you leave out? I think the authors struck the perfect balance. They decided not to cover installation of any of the databases, nor did they explain much about the programming languages that they used to interface with the databases (and they used an impressive number of languages: Ruby, Java, JavaScript, Groovy, and Gremlin). You were either expected to know it or be able to figure it out on your own. That was fine with me because they covered a ton of stuff in the short amount of space available to each database, and their explanations were crisp, brisk, and understandable.

The basic format of the book is one chapter per database, and each chapter is split into three sections representing three days of study. The first day covers the basic API and main features of the database. The second day goes through the more advanced features of the database that make it unique and interesting. The third day does something more involved using a large data set to populate the database and showing what kinds of structures and queries are best suited to that particular database. The end of each day includes some homework exercises to guide the reader in learning more about how each database works.

I thought this format worked extremely well, although the "seven weeks" in the title makes it seem like it would take longer to get through the book than it actually does. Since each chapter consists of three days, you could finish everything in three weeks if you did something everyday. Each day's work is also short enough that it is conceivable that you could squeeze seven databases into seven days instead of weeks, but you wouldn't be able to do anything else other than eat, sleep, and study to do it, so probably not realistic.

At any rate it's a great way to learn something significant about seven databases in a short amount of time. You'll learn their strengths and weaknesses and how to interface with them through code as well as the provided command line or HTTP interface. If you were looking for a new database to use on a project, this book would give you all the right information to figure out which database would work best as well as get you started on using it for your project.

I've come this far without even mentioning what the seven databases are, so I'll stop raving about the book and briefly describe each database that the authors cover.

PostgreSQL
PostgreSQL is a traditional relational database that has proven itself many times over on large software projects. As a relational database, the basic structure of the data it stores is in the form of tables that are made up of fields that describe characteristics of the data and rows that hold the values of those characteristics for specific entries in a table. Tables can be linked together with uniquely-identified primary keys, and queries can filter and sort the table data and use keys to find linked data in very flexible ways.

PostgreSQL has the advantages of being fast and having great driver support in most programming languages. It's a mature, reliable database with decades of development behind it, and being a relational database, most developers will be familiar with how it stores data and how to query it. As for weaknesses, it does suffer from scalability issues if you need to partition it onto multiple servers. Most relational databases don't handle partitioning nearly as easily as other types of databases. Also, if your data doesn't fit well into a rigid schema structure or mostly contains large blobs of data, PostgreSQL probably isn't the right database for your data.

Riak
Riak is a distributed key-value store, which means it's built from the ground up to run on multiple servers, and it stores and retrieves data using unique keys for each piece of data. The data can be anything from plain text to video clips, and the keys can be automatically generated or manually assigned. Data can be further divided into buckets to separate key-value pairs in a logical way. The interface to Riak is HTTP requests, and all the normal CRUD (Create, Read, Update, and Delete) operations are done through a slick REST interface using URLs so it's easy to interact with Riak through any programming language's HTTP library.

Because Riak is distributed, it cannot be fully consistent, available, and partition tolerant at the same time. This type of trade-off pops up in all kinds of places. In project management you can't optimize fully for time, cost, and features. In programming you can't optimize code fully for performance, readability, and development time. You have to pick two of the three options. Riak steps around this problem by allowing you to pick two options for each request to the server. One write to the database could be consistent and available, and the next write could be available and partition tolerant instead. That's cool.

HBase
The authors' description introduction to HBase is pretty good: 
Apache HBase is made for big jobs, like a nail gun. You would never use HBase to catalog your corporate sales list, just like you’d never use a nail gun to build a dollhouse. If your data is not measured by many gigabytes, you probably need a smaller tool.
It's a column-oriented database similar to Google's BigTable database, and it's designed to be fault tolerant and highly scalable. The structure of the database is a set of tables with each table containing rows that are kept sorted by the row key. The rows are divided into column families, and each column family in a row can have it's own set of columns. The same column family in two different rows can contain different columns. Every change to a row will create a new version of the row, so the database provides revision control for free.

Being a highly scalable database, HBase's major weakness is that it's not for small data sets. Like the quote says, if you're not dealing in huge amounts of data, HBase is going to be overkill.

MongoDB
MongoDB is a document database. Instead of storing data in tables, columns, and rows or key-value pairs, it stores data in documents. A document is a JSON string that contains an _id field and any number of other fields with a name and a corresponding value. That value can be a string, a number, an array, or even another hash, which is what the base document is. There is no defined schema, so each document can be completely different than the others, although in practice such an ad hoc database would be unusable. The application is responsible for maintaining the data structure that it requires.

MongoDB is another scalable database designed for storing massive amounts of data and running on large clusters of machines. It has more flexible querying capabilities than Riak or HBase, but you can easily shoot yourself in the foot with its lack of schemas. The application's data model has to be well defined, or else the database can quickly become a humongous mess.

CouchDB
CouchDB is a document database like MongoDB, and it has a REST interface like Riak. It was designed to be simple and flexible, and it can run on almost anything—from large data centers to your smart phone. It never deletes data stored to it. When you update a document, it creates a new version of it so you get revision control for free, and this mechanism is also the way CouchDB ensures data consistency in a distributed system.

Neo4j
Neo4j is an entirely different type of database, as the authors describe:
Neo4j is a new type of NoSQL datastore called a graph database. As the name implies, it stores data as a graph (in the mathematical sense). It’s known for being “whiteboard friendly,” meaning if you can draw a design as boxes and lines on a whiteboard, you can store it in Neo4j. Neo4j focuses more on the relationships between values than on the commonalities among sets of values (such as collections of documents or tables of rows). In this way, it can store highly variable data in a natural and straightforward way.
It would be perfect for storing a family tree or, as explored in the book, actor connections in movies so that you could query for actors six degrees from Kevin Bacon. It's an interesting departure from most other databases, and it's great for researching certain kinds of data problems.

Redis
Redis is another key-value store like Riak, but its defining characteristic is its speed. Redis is fast, and if it's configured to be a purely in-memory store, it's blazing fast. It's more like a cache than a database when set up this way, and normally it is backed by one or more other databases. Redis is used to speed up access to other databases more so than storing data itself. It also has a very fully-featured command line interface with over 120 commands that make it easy to use and integrate into a larger system.


These summaries give you an inkling of what's covered in Seven Databases in Seven Weeks, but I can't begin to do any of these databases justice in this short space. The authors even claim that they only scratch the surface, although I would say they do a phenomenal job of it. It was quite an entertaining read, and I finished it with a much better understanding of the wide variety of databases out there and what their strengths and weaknesses are. The appendix even includes a great comparison table of the different features and capabilities of the seven databases, a great reference to come back to later. I highly recommend this book for anyone who needs to find a new database to solve a gnarly problem or who is curious about the different kinds of databases available beyond the ubiquitous relational database.

A Crash Course in Databases


I set out to learn some useful things about databases, and with these two books, I definitely accomplished that. Database Design for Mere Mortals gave me a good understanding of the issues involved in designing for a relational database and how to solve common database problems. The design process should be fairly easily adaptable to other types of databases as well, and it gives a clear picture of where things can go wrong that will come in handy when you're designing for a schema-less database. Seven Databases in Seven Weeks exposed me to a wide variety of databases in a great hands-on way that was incredibly engaging. I'm looking forward to reading more of the Seven X in Seven Weeks series, and I hope they're all as good as this one.

Between the two books, I learned a ton of stuff about databases, but not everything, of course. There wasn't much coverage of relational algebra or the implementation of some of the fundamental features of databases, like indexing or storage management. I'll have to find a different book for those topics, but these two books were excellent in what they covered. They complement each other well with almost no overlap in material, and together they're a great crash course in databases.

Avoid Analysis Paralysis by Experimenting

Have you ever been stuck on a problem, not because you didn't have any ideas for how to solve it, but because you had too many? It's a common situation to fall into in software engineering because any given problem will have a multitude of solutions, and each of them boil down to writing and changing the right code, a process that can be quite quick and has a tight feedback loop. It's a much faster process than designing complex hardware systems, like integrated circuits or automobiles, and the turn-around time on an idea can often be measured in minutes to days instead of months to years. Often the programmer is faced with numerous potentially promising solutions at once. With so many choices and fast implementation times, a good way to get out of the analysis paralysis rut is to roll up your sleeves and start experimenting.

Software is flexible and malleable in a way that hardware is not. Software is more like writing or movies or music in that it can be bent, stretched, molded, and rearranged on a whim, so trying out ideas should be the default method of attacking small- to medium-sized problems. Obviously, attacking large problems in software by constantly tearing up and replacing the basic architecture of the code base will inevitably lead to disaster. Those decisions require more forethought and design, but for the day-to-day problems we all have to deal with, experimentation is a great way to discover what works best and keep moving forward.

If you find yourself in a discussion or meeting, debating the best course of action on a development problem, and you start to realize that the combined time of everyone in the room is approaching the time it would take to implement at least two of the options, it's probably better to start spinning up some experiments to prove out a winner instead of continuing to hash out arguments that have no clear answer.

One reason why we may resist experimenting in favor of discussions is that we have a mental block for changing working code in an unfinished project. Even though version control makes it easy to try out ideas and roll them back, it can still become harder and harder to make changes to a project as it matures. The project builds up inertia, and starts to resist change. Scott Berkun has a nice article about how ideas can become too precious, and they start to take on a life of their own:
Being precious means you’re behaving as if the draft, the sketch, the idea you’re working on is the most important thing in the history of the universe. It means you’ve lost perspective and can’t see the work objectively anymore. When you treat a work in progress too preciously, you trade your talents for fears. You become conservative, suppressing the courage required to make the tough choices that will resolve the work's problems and let you finish.
In the case of a software project the precious ideas are the accumulated work on the project that exists in the current state of the code base. To make progress, we have to push against the inertia of the code base and force improvements through experimentation. To push things in a new direction, branch the code and start experimenting with options until one emerges as a clear winner. If there is no clear winner, then you can still pick something from the implemented options and run with it, and you haven't wasted time debating options that didn't have much differentiation anyway.

When you're experimenting, don't be afraid to throw stuff away. It can feel like you're writing a lot of code that will never get used, but the code is not the important part. You're not throwing away work, you're gaining knowledge. The superfluous code is just a byproduct of gaining that knowledge. Over time the knowledge will accumulate and mature into wisdom. You'll be able to make better decisions, sidestep more pitfalls, and create viable solutions faster with the buildup of knowledge from regularly experimenting.

Experimentation turns out to be a useful way to make progress in many fields, some not so obvious. Take mathematics for example. My early impression of advanced mathematics was along the lines of mathematicians standing in front of blackboards, pondering proofs until they had an epiphany, and then writing down the proof in a burst of genius. The reality is quite different. Mathematics does involve plenty of thinking, of course, but it also involves plenty of experimenting with ideas. Often a hard problem must be broken up into pieces or changed into an easier problem to solve first. Lemmas, propositions, and minor theorems need to be built up to help in proving a major theorem, and a lot of the work of mathematics is building these tools to create the final product. If experimentation can work for such an abstract field as mathematics, it can certainly work for software development.

To get better at experimenting, it's important to get comfortable with spinning up an experiment quickly. The faster you can get a programming experiment started, the less friction there will be when you need to run an experiment to make a decision. I'm reminded of when I used to throw pottery (not literally picking up pots and throwing them, but making ceramic pots on a potter's wheel). When throwing pots, the first thing you have to do is center the clay on the wheel so that when the wheel spins at high speeds, the clay doesn't wobble at all. If the clay isn't centered properly, the pot won't stay symmetrical and may very well tear itself apart before you finish it. Much of learning to throw involves learning how to center, and once you can do it well and do it quickly, you can make more and better pots. You can run a lot more throwing experiments with your wheel time, and you more quickly discover what works and doesn't for making beautiful pottery.

The analog in programming is to get good at writing the necessary startup code to get to a basic working program as fast as possible. If you do web development, practice bringing up a new website and connecting it to a database. If you do embedded development, practice starting a new embedded application with tasks, interrupts, and communication interfaces that you use regularly. If you do mobile development, practice creating a new app and hooking up a basic GUI. Normally, you only do these things once at the beginning of a project, so the process isn't always fresh in your mind. If you practice it, then you can do it quickly and run more experiments with new code without having to tear up an existing architecture to do it.

Finally, experimenting with problems in code and trying out solutions will uncover issues that you would have never thought of otherwise. Sometimes a solution will fit right into the existing code base, much better than you thought it would, and can even make the code it interacts with simpler and more elegant. Other times it will require major transformations of other algorithms, data structures, and interfaces that you may not have thought through when the solution was an image in your mind's eye. Experimentation provides useful feedback on those things that you may have missed, and when you experiment with real code, you'll often find that you didn't need the complex, high-performance solution that you thought you did.

The ability to experiment is a useful tool to have in your programmer's toolbox. The results of an experiment are much more convincing than vague arguments when deciding among a set of options. Don't worry about writing code that may be thrown away because the useful work is the knowledge that's gained, and that knowledge can be reused to solve similar problems in the future. Instead, worry about getting faster at running experiments by practicing how to bring up new applications quickly in your environment. Then you can get the definitive answers you need to make progress.

Exploring the Tension in Software Abstractions

Ever since we started making tools as a species, we've been dealing with abstractions. We've been adding layer upon layer of abstraction to every form of work imaginable since those first hand-made chisels and knives long ago. With every layer of abstraction that we add, we can do the same amount of work with less effort or more work with the same amount of effort. Those layers of abstraction don't come for free, though. They add complexity. Especially within software engineering, where we have fine control over abstractions and can add many layers of abstraction even within one program, a tension exists between the abstractions we create and the complexity that results. We have to make intelligent choices with those abstractions while being aware of the trade-offs.

Movement and Computation


To get an idea of how we use abstractions to increase the efficiency of doing work, we can look at something physical that we've abstracted over time, like movement. How we get from point A to point B has had a huge impact on the development of civilization, and we've been abstracting how we get around for ages. We started out on our own two feet, but once we figured out how to train other animals to let us ride them, we could move faster with less effort and carry more stuff with us. That abstraction didn't come for free, though. We had to invest time training horses, donkeys, and camels, caring for our animals to keep them alive and healthy, and making equipment like saddles, bridles, and stables to improve the animals' usefulness. The additional work we had to do to support the abstraction of using animals for movement was worth the effort, and we could move around much better and get much more done than what we gave up in time spent caring for horses.

Another abstraction we layered on top of horse riding was attachments: buggies, wagons, and carriages. We could attach essentially a platform on wheels (with many variations of platform) to one or more horses so that we could carry much more stuff for much longer distances in more relative comfort than we could on horseback alone. This abstraction added more work to build, maintain, and repair the wagons, but it was still a big win in efficiency.

From here we could look at all kinds of other abstractions like bicycles, trains, airplanes, and rockets. Every abstraction of movement has its own advantages, trade-offs, and applications, but the one we'll look at is the automobile. It has replaced the horse with horsepower in the form of an engine (or most recently an electric motor) and combined that with the wagon for transporting people and stuff. We have spent massive amounts of time and energy designing, manufacturing, and supporting this abstraction, and we've made great gains in efficiency and convenience because of it.

Anyone who owns a car knows that cars need to be maintained. Periodically, the oil and filters need to be changed, the brakes need to be replaced, and various components need to be inspected for wear. Most of the time you don't need to know much about how the car works. You just need to know how to drive it to get around, but every once in a while the abstraction leaks. The car breaks down, and if you don't know enough about the abstraction to go under the hood and fix it, you might have to hoof it without the actual abstraction of a horse. Although, you likely have a modern abstraction of communication to help you line up a tow truck, a mechanic, and a ride to where you need to go.

Now compare the abstractions of movement we've developed with abstractions of computation. They have a lot in common. Computation has gone through its own series of abstractions, from teams of human "computers" that did calculations by hand to huge computing machines that read punch cards and used vacuum tubes for logic to the modern day computer with billions of transistors, terabytes of storage, and teraFLOPS of performance (at least). A modern software program is also built on many layers of abstraction, from a high-level programming language expression at the top to the silicon bandgap at the bottom.

Like transportation abstractions, our computation abstractions require huge amounts of design, manufacturing, and support effort. We have to maintain and repair our computers, and all of the abstractions that make them up can leak from time to time. They also allow us to do huge amounts of work with orders of magnitude less effort than we could do without them. Computers enable us to do all kinds of things that would be impossible without them, like measure the creation of previously undiscovered particles.

Layers of abstraction exist within a software program as well. Every class and function provides an abstraction of the computation that's done within it, and there are some trade-offs to consider when deciding when and where to add these abstractions. Software is quite flexible, and it's relatively easy to add or remove layers of abstraction to hide or expose more of a computation in a given function. That ability to finely layer abstractions for little cost changes the analysis a bit from the big abstractions of automobiles or microprocessors. In software design we are constantly doing small cost-benefit analyses, and that adds another dimension to the choices we make when writing code.

Queues and Logs


The trade-offs involved when choosing where to add abstractions to a program create a tension in the design of the program. There is a constant pull at every level of the design between hiding implementation details so that they can be ignored and showing those details so that they can be reasoned about in the context of where they are used. To explore this tension, let's look at an example.

Let's assume we're designing a logging feature for an embedded system. The embedded system could be anything, but let's say it's a robot. We want to keep track of vital information about the state of the robot while it's running, and be able to read back that state if something goes wrong to get clues about where the program messed up.

We don't have infinite space for this log, especially because it's an embedded system, so a circular queue would be a good data structure to use. We can add new log entries to the front of the queue and read them from the tail of the queue. Once the queue gets full, it will start overwriting the oldest entries at the tail of the queue with the newest entries that are being written to the head. Here is a basic implementation of a queue written in C:
/*
 * Queue.h
 */

#ifndef SOURCES_QUEUE_H_
#define SOURCES_QUEUE_H_

#include <stdint.h>

typedef uint8_t * QUEUE_ELEM_PTR;

typedef struct QUEUE {
  QUEUE_ELEM_PTR base;
  QUEUE_ELEM_PTR head;
  QUEUE_ELEM_PTR tail;
  uint32_t elem_size;
  uint32_t elem_count;
  uint32_t q_size;
} QUEUE, *QUEUE_PTR;

void Queue_init(QUEUE_PTR p_q, QUEUE_ELEM_PTR p_q_base,
                uint32_t elem_size, uint32_t q_size);
void Queue_enqueue(QUEUE_PTR p_q, QUEUE_ELEM_PTR p_new_elem);
int32_t Queue_dequeue(QUEUE_PTR p_q, QUEUE_ELEM_PTR p_elem);

#endif /* SOURCES_QUEUE_H_ */
/*
 * Queue.c
 */

#include <string.h>
#include "queue.h"

void Queue_advance_ptr(QUEUE_PTR p_q, QUEUE_ELEM_PTR *p) {
  (*p) += p_q->elem_size;
  if (*p >= p_q->base + p_q->q_size) {
    (*p) = p_q->base;
  }
}

void Queue_init(QUEUE_PTR p_q, QUEUE_ELEM_PTR p_q_base, 
                uint32_t elem_size, uint32_t q_size) {
  p_q->base = p_q_base;
  p_q->head = p_q_base;
  p_q->tail = p_q_base;

  p_q->elem_size = elem_size;
  p_q->elem_count = 0;
  p_q->q_size = q_size;
}

void Queue_enqueue(QUEUE_PTR p_q, QUEUE_ELEM_PTR p_new_elem) {
  QUEUE_ELEM_PTR p_write = p_q->head;

  Queue_advance_ptr(p_q, &p_q->head);

  // Advance the tail if the head has caught it.
  if (p_q->head == p_q->tail) {
    Queue_advance_ptr(p_q, &p_q->tail);
    p_q->elem_count--;
  }

  // Now add event in the old location.
  memcpy(p_write, p_new_elem, p_q->elem_size);

  p_q->elem_count++;
}

int32_t Queue_dequeue(QUEUE_PTR p_q, QUEUE_ELEM_PTR p_elem) {
  if (p_q->tail == p_q->head) {
    return -1;
  } else {
    memcpy(p_elem, p_q->tail, p_q->elem_size);
    Queue_advance_ptr(p_q, &p_q->tail);
    p_q->elem_count--;
    return p_q->elem_count;
  }
}
This queue implementation is an abstraction, and we can already see some tension between it and the code that would use it. For instance, the QUEUE structure contains an element size and a queue size that are set by the caller, and the Queue_init routine doesn't check to make sure that the queue size is a multiple of the element size. It's the caller's responsibility. If this were a data structure that would be part of a library, maybe it would be appropriate to do the check for the caller. Then we would have to define an error code and make sure the caller checked the error code if there was a risk of misaligned elements. Since this is for an embedded project where all of the source code would be available to the developer and the number of developers that will be using the code is relatively small, it might be okay to forgo the check. There's no clear answer, and that creates tension.

Some other trade-offs that were made were that the queue doesn't allocate any memory—the caller is responsible for that, too—and the queue enforces that the head and tail only point to the same element if there are no elements in the queue. That means the queue can never be truly full. It will always advance the tail and remove an element if the head catches the tail.

With this queue, we could fairly easily create a log structure that uses the queue as an abstraction and offloads some of the complexity of logging so that we don't have to think about all of the details at once. It's clear that putting the code for handling the queue in one place is better than having it spread throughout the logging code, and it allows us to leverage the abstraction to get more work done, much like using a car to run errands.

The queue's functions also adhere to the Single Responsibility Principle (SRP), with each function doing one thing. The one possible exception is the Queue_dequeue function, which returns the number of elements left in the queue. It's a slight deviation from SRP, but it's very convenient when looping through the queue to know when the last element was just dequeued. Again, we see that there's tension about what the best design choice is.

What if there's a crash?


It's all well and good that we have logging to a circular buffer for our robot, but what if the code crashes and reboots? What if the robot starts running amok and we have to hit the shutdown switch? The queue only resides in memory, so in most failure cases, it's likely to be lost before we could read it. To make the log more permanent, we want to write it to non-volatile memory. Most embedded processors have internal flash, so we could use that flash or attach an external flash memory to store the log. Two problems still remain with this solution, our queue implementation doesn't know anything about writing and erasing flash or about initializing memory that may have some valid elements in it already.

Let's start with the problem of initializing the head and tail pointers for memory that might already have values in it. Erased flash memory always has a value of 0xFF (hexadecimal notation), so we can use that value to figure out where the head and the tail of the queue are. Starting from the base of the queue, (or anywhere within the queue, really) the tail will be the first non-erased element after an erased element, and the head will be the first erased element after a non-erased element. We can add this code to the Queue_init routine like so:
bool Queue_element_is_blank(QUEUE_PTR p_q, QUEUE_ELEM_PTR p) {
  QUEUE_ELEM_PTR p_end = p + p_q->elem_size;
  for (; *p == 0xff && p < p_end; p++) ;
  return (p == p_end);
}

void Queue_init(QUEUE_PTR p_q, QUEUE_ELEM_PTR p_q_base,
                    uint32_t elem_size, uint32_t q_size) {
  p_q->base = p_q_base;
  p_q->head = p_q_base;
  p_q->tail = p_q_base;

  p_q->elem_size = elem_size;
  p_q->elem_count = 0;
  p_q->q_size = q_size;

  /*
   * head = 1st blank entry after a non-blank entry
   * tail = 1st non-blank entry after a blank entry
   */
  bool was_blank = true;
  for(QUEUE_ELEM_PTR p_elem = p_q->base;
      p_elem < p_q->base + p_q->q_size;
      p_elem += p_q->elem_size) {
    if (!Queue_element_is_blank(p_elem)) {
      if( was_blank ) {
        p_q->tail = p_elem;
      }

      p_q->elem_count++;

      was_blank = false;
    } else {
      if (!was_blank) {
        p_q->head = p_elem;
      }
      was_blank = true;
    }
  }
}
Adding this code to Queue_init shows another form of tension in writing functions. The two parts of the function, assigning values to the structure and finding the head and tail, could have been pulled out into their own functions and called from Queue_init. That would make Queue_init simpler because it would only be two function calls, and it would be much easier to understand at a glance. It may also be harder to introduce bugs because each step of the initialization process is isolated in its own function. Plus, a developer that only wanted to check how the structure is initialized or to look at the head-and-tail finding algorithm could focus on that piece of the code because it's in its own function.

On the other hand, keeping all of the initialization steps together at the same level of abstraction has its own advantages. It may be more complex, but not substantially so. The whole function is only about 30 lines of code. Having all of the code in one place makes it easier to see the entire behavior of the code at once. We don't have to jump back and forth between three different functions to see everything that it's doing. Seeing all of the details at once, as long as it's not too many, can make it easier to change the function, if needed, and see bugs that might be hidden between layers of abstraction. It's a difficult design choice, and every case will have its own conditions that tilt the balance one way or the other. In this case, it's probably best to keep all of the details together.

For the case of writing and erasing the flash in the Queue_enqueue function, things are a little different. The queue doesn't really need to know how to write and erase the flash, so those operations can be separate functions that are called when needed. However, the queue does need to know a little about the structure of the flash because normally flash memory locations can only be erased a block at a time, with a block being a range of addresses. To keep track of the tail, the queue will have to know the block size and advance the tail by that much whenever it erases a block. The new Queue_enqueue code looks like this:
int32_t Queue_enqueue(QUEUE_PTR p_q, QUEUE_ELEM_PTR p_new_elem) {
  QUEUE_ELEM_PTR p_write = p_q->head;

  Queue_advance_ptr(p_q, &p_q->head);

  // check if we need to erase things
  QUEUE_ELEM_PTR next_block = (p_q->head + FLASH_BLOCK_SIZE);
  bool tail_in_block = p_q->head <= p_q->tail && p_q->tail < next_block;
  bool erase = !Queue_element_is_blank(p_q, p_q->head);

  if (erase) {
    Flash_erase((uint32_t)p_q->head);
    if( tail_in_block ) {
      //need to move the tail to the next block
      p_q->tail = next_block - p_q->elem_size;
      Queue_advance_ptr(p_q, &p_q->tail);
    }

    uint32_t events_per_block = FLASH_BLOCK_SIZE/p_q->elem_size;
    p_q->elem_count -= events_per_block;
  }

  // now add element in the old location
  uint32_t ret = Flash_Write((uint32_t)p_write, p_q->elem_size, p_new_elem);

  if (ret) {
    p_q->head = p_write;
  } else {
    p_q->elem_count++;
  }

  return ret;
}
The function now returns an error code if the flash could not be written for some reason, and the FLASH_BLOCK_SIZE constant would be defined in the code for the flash. It would be nice if the queue didn't need to know about the flash because that's a different level of abstraction, and it's better to keep the entire function at the same level of abstraction. However, the head and tail management is bound up with the flash structure now, and it would probably add more complications and be more confusing if we tried to separate them. The other alternative, moving some of the head and tail management into the Flash_erase code, is even worse because the flash will most likely be used to store other things as well and mixing code for the queue into it would pollute the flash code.

In the end the compromise taken here isn't too bad because the function is still relatively short and understandable. Because it's an embedded system, not a general library, the implementation is already tied more closely with the hardware it's running on, and these kinds of compromises are made for the sake of code size and efficiency. A log in a different type of system would warrant a different choice, but then again, it may have a hard drive instead of flash, in which case the implementation would be totally different. Code should be designed for the context in which it is used.

This code is now a good starting point for implementing a logging function for our robot. The abstractions strike a good balance, and the benefits of using the queue with flash support likely outweigh the cost of understanding and maintaining it. It may not be as big of a jump as moving from a horse and buggy to a car, but programming is full of abstraction opportunities, and every well-designed abstraction helps get a little more work done for a little less effort.

Not Too Tight, Not Too Loose


We deal with abstractions all the time, and building abstractions on top of abstractions is how we make progress with technology. In software development it is especially apparent that abstractions are everywhere. The layers of abstraction within a program are thin, and they have a definite tension at the boundaries. Making the abstractions tighter can hide unnecessary details and improve code decoupling while making them looser can expose more interactions and improve code awareness. Good design balances the tension between those two concerns.

College is Much More Than a Means to a Higher Salary

Most entries into the debate about whether or not it's worth it to go to college focus on the financials. How much debt will you incur? Will you land a job that allows you to easily pay it off? Which degrees lead to higher salaries? As if college should be reduced to an accounting exercise.

To complicate matters, the field of software engineering has a history of producing billionaire college drop-outs: Bill Gates, Steve Jobs, and Mark Zuckerberg, and graduate school dropouts Larry Page and Sergey Brin. Of course, all of these billionaires started college, and besides, the vast majority of college drop-outs never achieve even millionaire status (survivorship bias). People also have the belief that it is entirely possible to be a self-taught software developer, and on the face of it, they're right, it is possible. Especially with today's online resources from MOOCs (Massive Open Online Courses) such as edX and Coursera to coding sites such as Codecademy and Treehouse, not to mention all of the blogs, books, and tutorials out there, learning how to program is well within reach for $2,000 or less so why spend $60,000 or more on a 4-year college education?

I think there's something missing from all of this financial analysis of college, and HN commentor threatofrain, in a debate about online learning resources, gets at part of it:
I have a feeling a lot of why people fail with these online resources is motivation, and I think it's why Khan Academy, while it is a forward-looking treasure of education, has failed to achieve revolution despite dramatically improving the access of education.

A lot of people need external mechanisms to keep themselves motivated, such as parental pressure, peer pressure, shame, and so on. As soon as people leave college, most people never learn a tall order of knowledge ever again, and most people let their existing knowledge rapidly decay. And then they're going to tell you a story about how everything they learned in college is useless, and how jobs want something entirely different.
A lot of this rings true for me. If someone doesn't have the motivation to get through college classes, it's unlikely that they'll have the motivation to learn on their own with online videos and exercises. I've watched my fair share of both online videos and in-the-flesh lectures, and I have to say, I've been almost always disappointed with the online variety. While in theory we should be able to take the best speakers and record them giving great lectures to be distributed online for all to experience, that does not seem to be the case in practice.

There is a big difference between watching a video online and actually being there, and retina displays haven't bridged that chasm. Regardless of the display quality, it seems that most online videos are mediocre at best. It is true that the average professor wasn't that great when I was in college, either. When professors taught straight from the textbook, I found that I was better off studying the book outside of class and using the lectures as review to cement the material in my memory. In most cases the professors probably didn't have much choice since they couldn't depend on students reading the material outside of class, the professors had to teach out of the textbook. I cannot deny that these lectures were boring, but they still weren't as bad as the stuff I've found online.

While some professors weren't all that good, the best ones were beyond excellent. Taking a course from a great professor was an incredible learning experience, and there's no way that experience could be reproduced online. I had a modern physics professor who would explore the quantum and relativistic physics of modern technology, like scanning electron microscopes and GPS, interactively with the class. I had a computer science professor who led class discussions on cutting-edge microprocessor designs. I had a history of science professor who packed a 600-seat lecture hall every class because he was a force of nature, and the energy in the room during his lectures was palpable. These professors were not all that rare, either. I would have at least one out of four courses each semester that I thoroughly enjoyed with an excellent professor, and that rate went up as I got into the more advanced courses. Having personal access to these professors in and outside of class was even more of a benefit that would be impossible to reproduce online.

Great professors are but one resource that colleges have over the self-taught route. Some other ones are the facilities, the other students, and the general environment of the college. A college has all kinds of places to explore and learn, and you're constantly surrounded by other people that are also exploring and learning. The entire atmosphere is infectious and helps motivate you to keep on exploring in a way that would be difficult to maintain if you were on your own. Learning is hard enough; why make it harder by isolating yourself from other people that could help you along?

College also provides a relatively safe place to experiment with ideas and push yourself in ways that you normally wouldn't out in the real world. This aspect of college is something I wish I would have taken more advantage of at the time. I would have spent extra time exploring some of the more interesting course material and building personal projects to apply the things I was learning. I was of the mindset that learning the material and doing well on tests was all I needed to succeed in college, but that is a narrow view of what college offers.

That is not to say that I had much free time. I was a member of KHK, a professional electrical engineering fraternity, and had a year-round internship where I put in 20 hours a week during the school year. I took advantage of the college social life (another important part of college), and made some lasting friendships, including my future wife. These are all incredibly valuable things to me, and I would have missed out on them if I hadn't gone to college. The cost of tuition and my post-graduation salary are almost secondary to the non-financial value I got out of college. I can't even put a price on most of the experiences I had in those years.

Like everything in life, you get out of college what you put into it. You can't expect to pay tuition and have an education spoon-fed to you. That tuition payment creates an opportunity, nothing more. It's up to you to put in the time and effort to transform that opportunity into a set of experiences that will shape your future and that you'll remember for the rest of your life. That opportunity is worth every penny.