09 January 2020

ShipBasher Development Log 2: Parsimonious Lexicography

This post is out of order because I wrote it and then wasn't entirely satisfied with what I had written, so I left it as a draft for months... Anyway, I left off Part 1 in the midst of a quandary as to what strategy to employ to accomplish my goal of a basic saving and loading system for ships and, eventually, other game files. To, amusingly, paraphrase the infamous Dennis Prager, it's a simple problem to explain but a very complicated one to solve. I needed to teach my computer to do the following:
  • Convert a ship into serialized data in the form of text
  • write the text in a file
  • read text from a file and build a ship based on it. If this all completes properly, that ship will be identical to the ship with which I started.
My choices were to use XML as I had before, use Unity's built-in "JsonUtility" class, import and use some third-party implementation of JSON or some other format, or to homebrew my own system from scratch. Naturally, being an overeducated amateur programmer, I opted for the most difficult of these solutions. Every seasoned programming professional and h4x0r knows that the proper thing to do here is nab something from GitHub or at the very least consult StackOverflow, but nooo, I'm an "overachiever," so without further ado here's my story.
I did a bunch of reading online about what was involved in the science of saving and loading, during which I learned of erudite terms such as "lexer," "parser," "lexerless parser," etc. At first the complexity scared me off and I dabbled in wrangling Unity's JsonUtility, but shortly thereafter I gave up due to the lack of control I had over how it worked - issues such as wanting to only serialize some fields in an object but not others, how to serialize compound types such as vectors, etc. It turned out that what I wanted to make was actually two things, corresponding to two bullet points rather than one:
  • Read text from a file and organize it into usable pieces (often called tokens)
  • understand what the tokens mean and assemble a ship based on them.
These two tasks correspond respectively to the duties of a lexer and a parser. I thus proceeded to get my first lexer up and running:
As seen here, what I had the system do was grab one letter from the file at a time and treat it differently based on a few flags, effectively creating a Finite State Machine. Look at me, using big professional words! Obviously what comes next is over-simplified refresher #45890149021 on what a Finite State Machine is:
A Finite State Machine is a thing that can have any one of a limited (finite) number of statuses (the states) and shift between them based on things that happen to it, including its own actions. Oft-cited examples include vending machines, whose states include "idle," from which inserting a coin shifts it into "waiting for selection," from which pushing a button shifts it into "dispensing," from which it shifts itself back to "idle" by moving some of its mechanical bits around in such a way as to (theoretically) give the user one of its items (even better, the item the user wanted).
My lexer has like ten binary flags now, corresponding to a huge number of possible states depending on how granularly one wants to define a state, so I won't be enumerating them all here, but here's an oversimplified interpretation of what it does:
  • Be in one of several states based on how the developer or user set it up or based on what has happened already. These are determined by which flags are set or unset, as detailed below.
  • Grab a character (a letter, a number, a symbol such as "{", or a whitespace character such as a tab or a line ending) from the file.
  • Change one or more flags if the character is special in some way. For example, if the character is a whitespace character, and the "inQuotes" flag is false, then the current token (word, phrase, numeric value, whatever) is complete. If the token is a "key" such as "entityName," then the next token is expected to be a "value" and not a key, and if it is a value such as "The Friend Ship," then the next token is expected to be a key and not a value, so in either case "isKey" should be toggled... unless a flag such as "inData" is true, indicating that multiple values exist to follow the most recent key, e.g. "0.5", "1", and "9.5" all follow and correspond to "position."
  • If the character is not special, i.e. it's just a letter, number, or punctuation or symbol that isn't one of the symbols I consider "special" for these purposes (for example a period holds no special significance to the lexer), add it to a string (technically I used a StringBuilder, but that's all, well, very technical) representing the current token. As detailed above, the current token will build up until a whitespace character is encountered, unless "inQuotes" is true, etc. etc.
  • When a token is completed, i.e. a whitespace character shows up as shown above (wow such non-linearity much complexity doge) and the current token isn't nothing (or empty), and there isn't a flag set that allows treating of empty tokens as actual tokens under certain conditions (yes there actually is one), the token is sent to the parser to be processed along with a modicum of data about the current state machine flags.
Despite the clumsiness with which I bumbled about explaining all that, I did make all of the code work in a very controlled and predictable manner, which is the important thing in situations like this. As can be seen above, the lexer is able to construct tokens for keys and for values and output these tokens for processing - in the above image, "processing" is simply displaying them on the screen with some formatting for clarity.

Since this has ended up somewhat longer than I'd imagined it would, the explanation of what the parser does shall be in another installment.

No comments:

Post a Comment

Sorry this isn't a real post :C

I've entered graduate school and, as I expected, suddenly become far busier than I had been before, leaving no time to maintain my blog,...