Search This Blog

Finding out if I've Improved as a Programmer: Part 3

I've been walking through one of my old college projects of a shell program in an effort to see how much I've improved as a programmer over the past however many years. So far this exercise has been pretty enlightening, and I hope to finish it up in this post. I've improved the formatting and added a better set of automated tests to the project, and I've made it part way through what originally was the only function in the simple shell program, main(), refactoring as I went. Now we're at the main loop of the program that reads in and executes the commands. I've tried to be as brutally honest as I can in assessing the faults of my old coding style, and I'll keep that up as we finish this out. Let's dive right in and see how to clean up the rest of the code. As before, you can follow along with my GitHub repo to see all of the refactorings in context.

Reading the Next Command to Execute


Right away when declaring the forever loop for this part of the code, I do a couple of odd things:
  // Main Loop
  while (TRUE) {
    printf("%s", PROMPT);
First, that comment is totally useless. It should be removed. Next, I declare TRUE as a static variable in the variable declarations of this function and assign it to 1. Then I only use it once right here. That's unnecessary. I could #include <stdbool.h>, but I prefer using the for (;;) method of declaring a forever loop, so we'll go with that instead. PROMPT is also declared as a static variable for some strange reason. Since main() is only called once, it makes no sense to declare PROMPT as static. It may still make sense to leave it as a variable because we could add a feature to change the prompt during run time, but that's getting ahead of ourselves. There's currently no such feature in this program, so I'm going to make PROMPT a define statement near the top of the file. That leaves the main loop looking like this:
  for (;;) {
    printf("%s", PROMPT);
The next block of code belongs in a function, which is obvious from the leading comment:
    // Read the command line and check for errors
    commandLine = (char *)malloc(LENGTH * sizeof(char));
    checkEOF = fgets(commandLine, LENGTH, input_src);
    printf("Command Line : %s", commandLine);
    rstrip(commandLine);

    if (checkEOF == NULL) {
      printf("\n");
      exit(0);
    }
    if (is_empty(commandLine))
      continue;
    if (strlen(commandLine) >= (unsigned)LENGTH - 1) {
      puts("\nInput exceeds valid command length.");
      puts("Input must be at most 512 characters.");
      while (strlen(fgets(commandLine, LENGTH, input_src)) >=
             (unsigned)LENGTH - 1)
        ;
      commandLine[0] = '\n';
    } // else if
In order to extract this function, we need to do a few things. First, LENGTH will now be used in main() and in the new function called get_command_line_from(). So that we don't have to declare it multiple times or pass it into the function, let's make it a define statement at the beginning of the file. We should also make the name more descriptive:
#define MAX_COMMAND_LENGTH 513
Next, commandLine and checkEOF need to be declared in the function. We can rename commandLine to command_line while we're at it. We also need to return command_line at the end of the function. Finally, we can no longer use continue in the function to skip to the next iteration of the forever loop. We'll need to return NULL instead and handle it where we call the function. The new function looks like this so far:
char *get_command_line_from(FILE *input_src) {
  char *command_line = (char *)malloc(MAX_COMMAND_LENGTH * sizeof(char));
  char *checkEOF = fgets(command_line, MAX_COMMAND_LENGTH, input_src);
  printf("Command Line : %s", command_line);
  rstrip(command_line);

  if (checkEOF == NULL) {
    printf("\n");
    exit(0);
  }
  if (is_empty(command_line))
    return NULL;
  if (strlen(command_line) >= (unsigned)MAX_COMMAND_LENGTH - 1) {
    puts("\nInput exceeds valid command length.");
    puts("Input must be at most 512 characters.");
    while (strlen(fgets(command_line, MAX_COMMAND_LENGTH, input_src)) >=
            (unsigned)MAX_COMMAND_LENGTH - 1)
      ;
    command_line[0] = '\n';
  }

  return command_line;
}
At the call site, we declare command_line and initialize it to the return value of the function. Then we check if it's NULL, and continue if it is to skip to the next iteration of the forever loop:
    char *command_line = get_command_line_from(input_src);
    if (command_line == NULL) continue;
Things are getting better, but while moving this code around you may have noticed that we have a memory leak hidden in there. Observe that command_line is getting allocated new memory every time through the loop, but that memory is never freed. We should free this memory at the end of the loop:
    free(command_line);
Now we should make sure the tests pass and do a commit because we're about to make a change that's going to change the output of the program. Why change the output? Take a look at the output of the program when you run the batch.txt file, and notice what happens when the line that exceeds the maximum length gets executed. Even though the shell says the line exceeds the valid command length, it still tries to execute whatever is left over at the end of the line after the maximum length blocks are read in. The reason that happens is in this new function, and we're going to fix it. But fixing it will change the output, so we want to make sure it's the only thing we do on this next commit. We're simply going to get rid of the line that sets the first character of command_line to a newline after the command line is read in, and instead we'll return NULL. Now the shell no longer tries to execute the end of the long line as a command, eliminating a bunch of erroneous output from the batch run. After we update the golden outputs for the tests, we can do another commit before moving on to refactor the rest of the function.

The next thing we should do with this function is reorganize it. The command line is read in from the input source, and then it is printed for debugging purposes and the trailing whitespace is stripped off before it is checked for errors. Really, it would be nice to not have to assign the output of the function to a variable to check for end-of-file. Then the fgets() call can happen right inside the if conditional. To do this, we can move the call to rstrip() down to just before the is_empty() check. Now, you may also notice a subtle bug in the code. Because the command_line is being modified before its length is being checked, we may not catch some command lines as being too long if they happen to have a whitespace on the 512th character. I could add a test for this bug, do another commit because the test output will have changed, fix the bug, and do yet another commit because the test output will have changed again, but it's an easy fix so I'm going to fix it and add the test later. To fix the bug, we need to move the rstrip() and is_empty() code after the length-checking code.

Then, the debug print statement can go right before the length-checking code, and the fgets() call can go into the if conditional. Lastly, I don't like how the MAX_COMMAND_LENGTH value is getting used. It's actually one more than the maximum command length for allocation and read purposes (because of the extra \0 to terminate the string), it's getting decremented for the length-checking, and it's still a magic number in the error print statement. Let's fix all of that by making it 512, explicitly adding one for allocation and reading, and making a formatted print statement. If you managed to follow all of that detailed explanation, great. The refactored function looks like this:
char *get_command_line_from(FILE *input_src) {
  char *command_line = (char *)malloc((MAX_COMMAND_LENGTH + 1) * sizeof(char));
  if (fgets(command_line, (MAX_COMMAND_LENGTH + 1), input_src) == NULL) {
    printf("\n");
    exit(0);
  }

  printf("Command Line : %s", command_line);
  if (strlen(command_line) >= (unsigned)MAX_COMMAND_LENGTH) {
    puts("\nInput exceeds valid command length.");
    printf("Input must be at most %d characters.\n", MAX_COMMAND_LENGTH);
    while (strlen(fgets(command_line, (MAX_COMMAND_LENGTH + 1), input_src)) >=
            (unsigned)MAX_COMMAND_LENGTH) ;
    return NULL;
  }

  rstrip(command_line);
  if (is_empty(command_line)) return NULL;

  return command_line;
}
I also brought the return NULL to the same line as the is_empty() check because that was just waiting for a bug to be introduced. Control statements should be either one-liners or the body should be wrapped in braces so when more lines are added to the body, it's clear that they are all part of the body of the control statement. After committing this code, we can add another test with the command line too long and a space at the 512th character. Then we can update the tests again and commit the new test.

Parsing the Command Line


The next block of code parses the command line into a command structure that will be used to execute the command:
    command = (command_t *)malloc(sizeof(command));
    if (command == NULL) {
      fprintf(stderr, "ran out of mem\n");
      exit(EXIT_FAILURE);
    }

    // Parse the executable and arguments from the command
    command->argc = 0;
    ///    command->argv = (char **) malloc(3 * sizeof(char *));
    command->argv[0] = (char *)malloc(MAX_COMMAND_LENGTH * sizeof(char));
    strcpy(command->argv[0], strtok(command_line, " "));
    printf("   0 : %s\n", command->argv[0]);

    char *sTmp = strtok(NULL, " ");
    int cnt = 1;

    while (sTmp) {
      command->argv[cnt] = (char *)malloc((MAX_COMMAND_LENGTH * sizeof(char)));
      strcpy(command->argv[cnt], sTmp);

      printf("   %d : %s\n", cnt, command->argv[cnt]);

      cnt++;

      // do again.
      sTmp = strtok(NULL, " ");

    } // while

    command->argc = cnt;
    command->argv[cnt] = NULL;

    cnt = 0;
    while (command->argv[cnt]) {
      printf("  %d arg %s\n", cnt, command->argv[cnt]);
      cnt++;
    }

    // int temp = strlen(command->argv[command->argc-1])-1;
    // command->argv[command->argc-1][temp] = '\0';
    // command->argv[command->argc] = NULL;
    printf("%s\n", command->argv[0]);
    strcpy(command->name, command->argv[0]);
    printf("%s\n", command->name);
This code looks long, but it can be refactored in a similar way to the code that parses the shell path, and there's a lot of debug statements in there that will eventually be removed. Before we get into that, though, let's extract the function and remove the irrelevant commented-out code:
struct command_t *parse_command_line(char *command_line) {
  struct command_t *command = (command_t *)malloc(sizeof(command));
  if (command == NULL) {
    fprintf(stderr, "ran out of mem\n");
    exit(EXIT_FAILURE);
  }

  // Parse the executable and arguments from the command
  command->argc = 0;
  command->argv[0] = (char *)malloc(MAX_COMMAND_LENGTH * sizeof(char));
  strcpy(command->argv[0], strtok(command_line, " "));
  printf("   0 : %s\n", command->argv[0]);

  char *sTmp = strtok(NULL, " ");
  int cnt = 1;

  while (sTmp) {
    command->argv[cnt] = (char *)malloc((MAX_COMMAND_LENGTH * sizeof(char)));
    strcpy(command->argv[cnt], sTmp);

    printf("   %d : %s\n", cnt, command->argv[cnt]);

    cnt++;

    // do again.
    sTmp = strtok(NULL, " ");

  } // while

  command->argc = cnt;
  command->argv[cnt] = NULL;

  cnt = 0;
  while (command->argv[cnt]) {
    printf("  %d arg %s\n", cnt, command->argv[cnt]);
    cnt++;
  }

  printf("%s\n", command->argv[0]);
  strcpy(command->name, command->argv[0]);
  printf("%s\n", command->name);

  return command;
}
After changing the command assignment in main() to call parse_command_line() and making sure the tests still pass, we can start cleaning up this function. Let's start by looking at the parsing code after the first comment and up to the end of the first while loop. This is a verbose, convoluted way to use strtok(), and it shows again how I didn't really understand how to use it. There is no need to allocate memory for strings or copy strings here because strtok() inserts null terminators right into the string that's being tokenized. All I have to do is assign the command->argv pointers to the start of each token, and the rest takes care of itself. I can replace all of that code with a simple for loop (removing the useless comments):
  int cnt = 0;
  for (command->argv[cnt] = strtok(command_line, " ");
       command->argv[cnt];
       command->argv[cnt] = strtok(NULL, " ")) {
    printf("   %d : %s\n", cnt, command->argv[cnt]);
    cnt++;
  }
Assigning command->argv[cnt] to NULL after the loop is redundant, so it can be removed. Then, we can convert the next while loop into a for loop to save a couple more lines of code:
  command->argc = cnt;

  for (cnt = 0; command->argv[cnt]; cnt++) {
    printf("  %d arg %s\n", cnt, command->argv[cnt]);
  }
Things are getting better. Now, let's pause for a moment and talk about command_t. The way it's currently defined, command_t is an incredibly wasteful struct:
struct command_t {
  char name[1024];
  int argc;
  char *argv[1024];
};
First, name is redundant because it is an alias for argv[0], so it shouldn't be defined as a fixed-size array and really doesn't need to be there at all. Next, argv should not be defined as a fixed-size array either. We can easily figure out how many arguments there are and size the array appropriately. Finally, I have no idea why I made those two arrays the same arbitrary size of 1024. Laziness and incompetence, that's all I can come up with. We can fix all of those problems by replacing name with argv[0] and counting up the arguments before tokenizing to allocate the array for argv. The former is self-explanatory. The latter looks like this:
  command->argc = count_char(command_line, ' ') + 1;
  command->argv = (char **)malloc((command->argc + 1) * sizeof(command->argv));
  if (command->argv == NULL) {
    fprintf(stderr, "ERROR: Out of memory in parse_command_line() allocating command->argv\n");
    exit(EXIT_FAILURE);
  }
This code goes before the for loop that tokenizes the command line, and we can remove the command->argc assignment after the loop. The command_t struct also changes to this:
struct command_t {
  int argc;
  char **argv;
};
Notice that we're reusing the count_char() function that was defined for parsing the shell path. In this case counting up spaces could result in over-counting because there could be multiple spaces in a row that end up separating only two tokens, but the worst that happens is that a larger array gets allocated than is necessary. That's not a big deal because the command length is limited so the array couldn't get arbitrarily large. In fact, it will always be smaller than it was with the old struct definition if we continue to limit the command size to 512.

We should take care of one more thing with command. Like with command_line, we're allocating memory for it but not freeing it. We should add these lines at the end of the infinite loop in main():
    free(command->argv);
    free(command);
While we're down here, we can also delete all of the commented-out code that frees other memory. I probably was trying to fix some segmentation fault before and left the trial code lying around. We'll add in any other free() calls we need later, but these commented-out lines are just distracting right now. With that, parse_command_line() is looking pretty good. We can do another commit.

Now, I had said that parsing the command line was similar to parsing the shell path, but we're not really doing them the same way. The parse_shell_path() function breaks the shell_path string into tokens and then copies the tokens into an array of strings. The parse_command_line() function points each element of the array of strings directly at a token in the command_line. We can easily make the tokenizing part of parse_shell_path() look like that of parse_command_line() by duplicating shell_path first and then converting the tokenizing loop over to the new way, like so:
char **parse_shell_path() {
  char *shell_path = getenv("SHELLPATH");
  if (shell_path == NULL) return NULL;

  char *shell_path_dup = (char *)malloc((strlen(shell_path) + 1) * sizeof(char));
  if (shell_path_dup == NULL) {
    puts("ERROR: Out of memory in parse_shell_path() allocating shell_path_dup");
    exit(EXIT_FAILURE);
  }
  strcpy(shell_path_dup, shell_path);

  int path_count = count_char(shell_path_dup, ':') + 1;
  char **command_paths = (char **)malloc((path_count + 1) * sizeof(char *));
  if (command_paths == NULL) {
    puts("ERROR: Out of memory in parse_shell_path() allocating command_paths");
    exit(EXIT_FAILURE);
  }

  int i = 0;
  for (command_paths[i] = strtok(shell_path_dup, ":");
       command_paths[i] != NULL;
       command_paths[i] = strtok(NULL, ":")) {
    i++;
  }

  return command_paths;
}
This kind of thing happens a lot when refactoring. As code gets reorganized and cleaned up, optimizations discovered in one place can transfer to other places, resulting in more reuse and sudden improvements in how things are done. At this point it should be easy to extract a function that does the tokenizing for both strings, but the parse_command_line() code has an extra debug statement in it that's mucking things up. I want to put off changing the program output again until I can tear out all of the debug statements at once, so I'll leave this exercise until the final cleanup. For now, we'll commit things how they are and move on to the last section of code.

Executing Commands


The next section of code starts off by checking if the command is the exit command, exits if it is, and proceeds to execute the command if it's not:
    if (!strcmp(command->argv[0], "exit")) {

      if (command->argc == 1) {
        exit(0);
      } else {
        puts("\nexit has too many arguments.");
        puts("Type \"exit\" or press Ctrl-D to exit.\n");
        execute = 0;
      } // else
    }   // if

    if (execute) {
The first issue with this code is that the if conditional is not very clear in what it is doing. It looks like it's checking if the command is not "exit" because of the '!' operator, but it's actually checking for a match with "exit." Comparing against zero is better in this case. The other issue is that the execute variable is only used one, right in that if conditional in the last line in the code above. That's unnecessary because if the execution path leaves the first if statement without exiting the program, it definitely shouldn't execute the command. Therefore, we can make the second if statement an else instead and get rid of execute:
    if (strcmp(command->argv[0], "exit") == 0) {
      if (command->argc == 1) {
        exit(EXIT_SUCCESS);
      } else {
        puts("\nexit has too many arguments.");
        puts("Type \"exit\" or press Ctrl-D to exit.\n");
      }
    } else {
Let's extract the whole body of the following else branch into a function called execute_command() right away and do some light cleanup before going any further:
void execute_command(struct command_t *command, char **command_paths) {
  int i = 0;
  int pid, rc;
  int error = -1;

  rc = fork();
  if (rc == 0) {
    char *tempCmd = (char *)malloc((strlen(command->argv[0]) + 1) * sizeof(char));
    strcpy(tempCmd, command->argv[0]);
    while (error == -1) {
      printf("Doing execv %s\n", tempCmd);
      int cnt = 0;
      while (command->argv[cnt]) {
        printf("  %d %s\n", cnt, command->argv[cnt]);
        cnt++;
      }
      execv(tempCmd, command->argv);

      if (command_paths[i] != NULL) {
        tempCmd = (char *)realloc(tempCmd, (strlen(command_paths[i]) + 1) * sizeof(char));
        strcpy(tempCmd, command_paths[i]);
        strcat(tempCmd, "/");
        strcat(tempCmd, command->argv[0]);
        i++;
      } else {
        puts("\nCommand Not Found.");
        puts("Please check your path and filename.");
        puts("Note: programs needing user input cannot be executed with this shell.\n");
        exit(0);
      }
    }
  } else {
    pid = wait(&rc);
  }
}
The error and tempCmd variables were moved into the function, too. Those were the last remaining variables in the original variable declaration block at the beginning of main(). Without even focusing on it directly, we've eliminated that whole block of code as we've been improving other things. Refactoring tends to have those nice side effects. The next refactoring should be to rename the variable tempCmd because that's just a horrible name. The thing that it represents is not a temporary command; it's the full command to attempt to execute. Let's call it full_command.

Before getting into more refactorings, let's stop and take a minute to appreciate this function. This is the core functionality of the program, and the whole purpose of doing this project in the OS course I took. It demonstrates how to use fork() to spin off a new process and then use execv() to execute a file in that new process image. The execv() function has some interesting behavior. If it doesn't find the file it's supposed to execute, it returns -1. If it successfully executes the file, it doesn't return at all. Instead, the process it's running in completes and any code after the execv() call never executes in the child process. With that behavior in mind, this code is fairly decent, but I have a few minor improvements:
  • The variable i declaration should be moved down to the first while loop.
  • The variable rc and pid declarations should be moved down to their first (and only) assignments.
  • The variable error can be eliminated.
  • The command length can be stored in a variable where it is first calculated because it should be added to the path length in the realloc() call later on. This was a bug that never really showed itself in testing.
  • The first while loop can be converted into a for loop (see code below) that looks like it would be an infinite loop, but the child process will exit because of how execv() works.
  • The string concatenation lines can be combined into a single sprintf() call.
  • The exit(0) call should use the named constant EXIT_SUCCESS.
The revised function looks like this:

void execute_command(struct command_t *command, char **command_paths) {
  int rc = fork();
  if (rc == 0) {
    int command_length = strlen(command->argv[0]);
    char *full_command = (char *)malloc((command_length + 1) * sizeof(char));
    strcpy(full_command, command->argv[0]);

    for (int i = 0; ; i++) {
      printf("Doing execv %s\n", full_command);
      int cnt = 0;
      while (command->argv[cnt]) {
        printf("  %d %s\n", cnt, command->argv[cnt]);
        cnt++;
      }
      execv(full_command, command->argv);

      if (command_paths[i] != NULL) {
        int command_path_length = strlen(command_paths[i]) + command_length + 2;
        full_command = (char *)realloc(full_command, command_path_length * sizeof(char));
        sprintf(full_command, "%s/%s", command_paths[i], command->argv[0]);
      } else {
        puts("\nCommand Not Found.");
        puts("Please check your path and filename.");
        puts("Note: programs needing user input cannot be executed with this shell.\n");
        exit(EXIT_SUCCESS);
      }
    }
  } else {
    int pid = wait(&rc);
  }
}
With that refactoring the last extracted function looks nice and clean, except for the debug statements that we still have to take care of. But first, let's take a look at the new main():
int main(int argc, const char *argv[]) {
  FILE *input_src = get_input_source(argc, argv);
  char **command_paths = parse_shell_path();

  for (;;) {
    printf("%s", PROMPT);

    char *command_line = get_command_line_from(input_src);
    if (command_line == NULL) continue;
    echo_command_from(input_src, command_line);

    struct command_t *command = parse_command_line(command_line);

    if (strcmp(command->argv[0], "exit") == 0) {
      if (command->argc == 1) {
        exit(EXIT_SUCCESS);
      } else {
        puts("\nexit has too many arguments.");
        puts("Type \"exit\" or press Ctrl-D to exit.\n");
      }
    } else {
      execute_command(command, command_paths);
    }

    free(command_line);
    free(command->argv);
    free(command);
  }
}
Wow. It went from 222 lines down to 29 lines, and it is much easier to read and understand. You can clearly see the flow of tasks that the code performs by skimming this function. First, the program finds the input source from the command line arguments and parses the shell path. Then, it will continuously read in command lines, parse them, and execute them until there are no more or "exit" is reached. Simple and clear, just like it should be. Let's commit this code, and then there's a couple more cleanup tasks to do.

It's finally time to get rid of all of those debug statements and clean up the program's output. I won't bore you with the details. Each debug print line can just be deleted. After removing all of the debug statements, I noticed that the prompt was getting printed to stdout extra times. This is because the print statement that prints the prompt doesn't have a newline at the end of it, so the print buffer needs to be flushed with fflush(stdout) before executing the command to get the output to line up correctly. I made that change in this same commit. The golden test outputs were updated for the changed output as well.

Now we can clean up one last thing in the parsing code. Remember how tokenizing the shell path string and the command line string were slightly different because of the debug statement in parse_command_line()? Well, now they're the same, so we can extract the function tokenize() and use it in both parsing functions. Here's the function:
char **tokenize(char *str, const char *delim, int count) {
  char **tokens = (char **)malloc((count + 1) * sizeof(char *));
  if (tokens == NULL) {
    puts("ERROR: Out of memory in tokenize() allocating tokens");
    exit(EXIT_FAILURE);
  }

  int i = 0;
  for (tokens[i] = strtok(str, delim);
       tokens[i] != NULL;
       tokens[i] = strtok(NULL, delim)) {
    i++;
  }

  return tokens;
}
Here's how it's used in parse_shell_path():
  int path_count = count_char(shell_path_dup, ':') + 1;
  return tokenize(shell_path_dup, ":", path_count);
Here's how it's used in parse_command_line():
  command->argc = count_char(command_line, ' ') + 1;
  command->argv = tokenize(command_line, " ", command->argc);
  return command;
It's a nice little reuse example to end our refactoring exercise. After running the tests one more time and doing one last commit, I think we can call this program done.


The code has changed dramatically over these three long posts, and now it's hardly recognizable compared to the original monolithic main() function. Numerous bugs were fixed, the code is more understandable because it's broken up into independent ideas, the structure is drastically improved with a layered organization of tasks, large sections of code have been simplified and streamlined, and now there are tests(!) so changes can be made with less fear of breaking something inadvertently. I found this to be an enlightening and surprisingly satisfying exercise in taking a hard look at some of my old code and refreshing it with all of my current opinions and ideas on what clean code is. I would highly recommend digging up some small, old program and giving it a makeover. You'll get a great picture of how you've changed as a programmer. I'm sure there are things I could have done better, and in another fifteen years I'll probably look back on what I did here and see a host of new improvements that could be made. At least, I hope that's the case. That's how it should be if I continue to improve.

No comments:

Post a Comment