The Soul of an Internet Machine | The Color of Language

February 3, 2023 • 34 Minutes

The Color of Language

The inventions, designs, and concepts developed during the 1970s guide modern software development practices. They work. They are efficient. In the 21st Century, our software must support multiple cultures, multiple languages and bring a degree of flexibility the early pioneers did not design into their practices. How do we accommodate both? During the episode, we explored the impact of assumptions we make. We explored historical assumptions and guidance about how data ought to be structured then I presented exceptions to these “rules”. When developing complex systems, we must find ways through complex challenges.

When we start a new software project we “swing with a big hammer”, standing up structures and framework quickly. What we build resembled the both the native Oracle APEX environment plus the client’s colors, logos, and fonts. Through this initial framework, we, the development team, offered a vision. From here a shared vision developed through shared work. We also needed to accommodate the multiple languages of French, Dutch, and the English that the developers depended on. We need to comply with European data privacy laws which are stricter than those in the United States. Building a robust application that accommodates the nuanced complexities that our client will require to differentiate and support their business practices, we design speed and resilience into our system. Designing for speed, also means avoiding techniques that rob us of speed.
Speed within a database environment involves honor a series of rules called “Data Normalization”. A discussion about data normalization typically involves slides discussing abstract rules about optimizing complex data within a database environment. Edgar “Ted” Codd, born in 1923, invented the relational model for database management. He worked for IBM. He received the Turing Award in 1981. Dr. Cobb developed the relational model for databases in 1969. Sixty years later, his initial work expanded. The initial data normalization steps I learned have expanded and some nomenclature changed.
The amazing thing to me is that people being invented relational databases. And people invented programming languages. Today, we argue about these topics forgetting the humanity underpinning these technologies.
My favorite college professor, John Jungck, stood before his Bio 101 course each year to give a lecture called: “There are No Facts”. Unlike modern disputes about facts, he forced no political agenda. He challenged each in the audience to approach science and technology with an open mind. We must eschew assumptions. “Oh, you think two plus two equals four?” Then he reaches under the lectern. He pours two liters of clear water-like fluid into a container with two liters of clear water-like fluid. The total was a bit less than the expected four liters. A bit of chemistry happened given that one of the fluids was not water. Bluntly put, he performed a parlor trick for us. As an illustration, it works. He encouraged student to be curious, skeptical, and carry a bit of doubt when people get dogmatic about anything.
All of this technology we use to build and support software applications results from inventions and ideas that came the mind of people. People like you and like me. We invented it all.
When honoring the process of normalizing data, we gain speed and maximize performance within relational databases. I do promise, we are talking about how to manage multiple spoken languages within a database.
In Episode 2 “Data Tables”, I mentioned that my colleague Dirk provided us with 132 data table definitions filling over 5000 lines of text. I discussed the importance of a unique primary key for each row, or record, of data. One row of data contained the data profile for precisely one subject. One example was the customer table. The customer table has a primary key called the customer_pk. All of the data within that row must related to that exact customer. If it doesn’t related to or describe an element of the client’s profile, then it does not belong.
That statement of “fact”, which I surrounded with quotes, complies with the First Normal Form of data. Thank you Doctor Cobb. You documented this idea and stamped it with a name in 1970.
When a data row has a singular primary key, as our customer table does with the customer_pk, then it meets the standards for the Second Normal Form. Well done us. This seems obvious to software developers. It seems so obvious to some that we forget the intelligence and humanity behind this concept.
Database tables relate to each other. We do not store each and every invoice within the customer table. Instead, we create a table that stores the invoice data. In that table, we’ll have a unique primary key called invoice_pk (meeting both the first and second normal forms – well done us). In the second position of that invoice table, we keep the foreign key for the customer table. That foreign key, our team calls customer_fk. I look at that knowing instantly that invoices relate to customers and the customer foreign key connects to the customer primary key. We break the data into logical and non-duplicating elements.
For example, some may opt to put a customer’s address in the customer table. Here’s the street name, the municipality name, the state or province, and the postal code. Suddenly, we find a customer with two addresses. Or maybe one address is for mailing and the other is for shipping. Maybe one address is the physical address, etc. We have all experienced this complexity. When I order from an online vendor, I have one postal code for our physical address and a separate postal code for our mailing address. When I want items shipped to our home, I use the postal code for a municipality that is 50 kilometers east of us. The credit card statement goes to a different postal code which carries a different municipality name. Neither of these agree with the emergency services (or 911/999) address for our farm. That is a third address. Yes, all of these three addresses land on our 40 hectare property clinging to the side of the world’s oldest mountain range.
To manage complexity of addresses, we ought to have a table for addresses. The first column is reserved for the address primary key (yes, of course it is called address_pk). The second column is then the customer foreign key (customer_fk).
These relationships and efforts to segregate data into a parent/child relationship satisfies the third normal form. Customer is a parent to both address and invoices. In human-speak, one customer may have zero, one, or more addresses. One customer may have zero, one, or more invoices. The parent table may have zero, one, or more related child rows of data in other tables.
I have always had difficulty in telling a story about the 4th normal form of relational data. It is a subtle shift in perception of data duplication. In the prior forms, we strove to eliminate the duplication of data within a row. In short, the normalization process discourages us from having fields such as Address 1 and Address 2 and Address 3 and Address 4 in our tables. We should pull that mess out to create an address table, then simplify the customer table. We want to remove these duplicate-like fields of Address 1, Address 2, Address 3. It makes our lives easier. Picture a customer with only 1 address. Can you picture how to handle a customer with four addresses? How do we handle that?
In the 4th Normal Form, we strive to reduce redundancy between rows of data. In the Second normal form, we don’t want to keep adding fields to accommodate new and more address for a customer. That’s inefficient and difficult to write. It brings more problems than it solves, Dr Cobb was right.
In the 4th Normal Form, we reduce the number of rows by creating related tables. Imagine that each contact at a customer also included the customers address, the customer’s primary phone number, and such. In this example, five rows of contact data would have the same Customer Name. Five rows would have the same address. Five rows would have the same phone numbers. This can result in table having compound keys. The most important data fields are duplicated. When searching for the contacts for Acme Company, I get five rows with the name Acme Company. I see five rows with the same phone number and five rows with the same address. Imagine that two of my contact people have the surname Gonzales. What if a father-son duo work there. We suddenly have Pablo Gonzales, the father, and Pablo Gonzales, the son. When we look at the data, do we have one person or two? And what is the key we need to find one person uniquely. Searching to this contact table gets messy and complicated.
Sometimes developers use two keys to find data, or three keys to find data. We must use compound keys to find data. This violates the 4th Data Normalization Form. We should “decompose” the data or normalize the data.
Database find data quickly because we index the data which sorts the data, often alphabetically. When we search for Pablo Gonzales, the database optimizes its efforts to know where “P” is located – obviously, I have over simplified. The database does not search through every row of data looking for Pablo. It searches through a pre-sorted and store abstraction of the data. Searching in these pre-sorted data abstracts is incredibly fast. We win the speed game. Thank you, indices, we love you.
These indices work only if the software developers honored the data normalization forms developed by Dr. Cobb and his colleagues. We want to reduce the duplication of data by relying on the relational nature of data tables. We do not want to jump sideways in a table trying to find the customers address, especially through a bunch of fields named address 1, address 2, address 3, address 4. We do not want to fill out data tables with rows of redundant data. That bloats the data and causes performance problems with indices. Often when we create redundant data going vertically through a table, we then relay on more than one key to find data. Managing compound keys is a pain in the anatomy.
As this project progresses, we deliberately step away from these ideals. Every time, the hero of our story (yes, me) fusses and fumes about the potential mess of someone else’s design and suggestions. Most of the time, I lose. To be a tactful and polite human, I ought to say that I yielded my opinion for the greater harmony of the team. Nobody likes a dogmatic whiner saying: This is the right way. I don’t want to be a dogmatic whiner. I want to create the best, fastest, and most supportable software for clients.
Well done working with me through the Boyce-Cobb Normalization Form. It will come up as the project progresses. This segmentation and organization of data falls beyond the standard thought models that normal human have. There’s the struggle.
Should I stand with a shield and spear at the castle gates defending the fundamentals of a relational database design? Or not? I do not want to be that ass saying: “We been doing it this way since 1970 and I don’t know why we should change now?” I hate that sentiment. I’ve gotten very tired of hearing it. I shall not want to say it to another.
On the other hand, when we violate the rules, we degrade performance. We degrade the flexibility of our data. We degrade our ability to support the systems. We degrade our ability to expand systems.
We use the world’s premiere database platform. That platform had been invented and built around Dr. Cobb’s data model. If you want the best, do it right. Not all applications require optimal data performance. Not all applications store millions of rows of data. Therefore, not all applications require a robust, world-class database environment like Oracle and its competitors.
If our decision is Oracle and Oracle’s database, then should we not then endeavor to follow the four fundamental rules of how to organize data.
It would all be perfect if it were not for human beings. Like Plato’s Ideal Plane, the ideal data normalization form often remains an objective instead of a reality.
In each decision, we face the classic cost-benefit analysis. Our storage of the French, Dutch, and English data violates the 3rd Normalization Form. Adding German, Spanish, or Italian will be complicated and expensive. On the other hand, we saved hundreds of hours. The query we write to look up these data are 99% the same. The consistency pays off. We reduce the risk of errors. We reduce the need to slow down to refer to a table to discover what we called the French name field in this table. It is always the same.
In my text editor, Notepad++, I kept these queries so I can copy and paste them. I don’t want to type them again. The query reads like this:

Select

case

when user’s language preference is Dutch, then show Dutch

when user’s language preference is French, then show French

else, show English,

End with an alias of Name,

The second field is the table’s primary key

From my lookup table

Where active is yes

Order the results by the display order, then alphabetize by the first field

A Bit awkward reading an Oracle Select statement on a podcast without a slide or printed text. I intend to demonstrate the consistency we gained by violating the 3rd Normalization Form.
Together our team assumed we will only and forever support three languages. This decision gained us speed and consistency. Instead of risking inefficiency, we gained efficiency.
Look at this from another perspective? Could our client quickly and easily being to offer services throughout the European Union? Not easily. That falls beyond our assumption.
How might one do this?
Option 1: We create a new row in each table for each language we support. Here is the By Post row in German. Here is the By Post row in Spanish: “Por correo postal”. Here is the By Post row in Italian: “Per posta”
Option 2: We use a separate table to store the name fields in the various languages and keep the other fields in the parent row.
Option 3: We get super esoteric and use the never discussed 6th Data normalization form developed in 2003. So esoteric that I’ll gloss over it. It is not used often. Each data table has a primary key and one field. Very odd.
Option 4: We jump free of the constraints placed on us in the 1970s. The tables have one field called Name. We then store the language we want in the JSON data format. JSON is a text-based data structure that we can store within a field. Oracle can query it well and fast.
My first instinct, that first solution I argued for, which I described as Option 1 satisfies the 3rd Normalization form then immediately violates the 4th Normalization Form. Options two and three are rather stupid and in efficient when writing code and developing APEX web pages.
In Option 4, I suggested we could store the various languages within a JSON data structure. It is possible and clever. It satisfies the common data normalization rules from the 20th Century while embracing 21st Century technology common to the internet.
We did not do it this way. But instantly I could picture the time and financial investment to take the application from regional to global. Too cool, huh? That the sort of flexibility we need to introduce in software. Few of you listeners may remember or experienced the Y2K shenanigans. It resulted from a pair of flawed assumptions. Database developers working during the middle part of the 20th Century assumed their software would last fifty, or forty, or thirty years. Database application developers made this assumption in some cases. In other cases, the companies that developed programming languages and database environments decided to save space by ignoring the “century” portion of a data. Just a dumb decision when inventing stuff.
Sometimes, we learn these lessons. Sometimes, we fail to learn them and repeat them.
Duplicating data to accommodate the languages of Europe or of the UN fell beyond the assumptions Dr. Cobb made with the data normalization models he created. We failed to assume we may want to use the same software and same data in multiple languages simultaneously.
This episode focuses on assumptions and their long-term impact to technology and projects such our Electrotest endeavor, I also explore displaying the client’s desired font.
Browsers do not inherently know about fonts. Browsers must look to an external library of definitions to know how to shape a letter within a font. Electrotest provided us a brand book informing us of their color palette and their corporate font selection.
I looked within the common font libraries to find the exact font. I then selected the next closest. It was sourced by Google as a Google Font. I deployed these fonts with a quick line in the header of our HTML process linking to Google as the remote library for the fonts. It looked good. It worked well. And apparently, the practice violated European data privacy laws called GDPR (General Data Protection Regulation). Someone observed that when one links to Google for services such as fonts, Google gains insights into your web application. Quite literally all they need: who, where, what browser, how long a user is connected, what pages were touched by the user. Google use the free font reference library as a way of capturing more data that they then use to track us and monetize our behavior.
I assume that was the case when you logged into an application with your Google or Facebook account. Ain’t nothin’ free. Yes, using Google’s font came easily and free. Google does not do this for free, nor does it provide these libraries for the public’s good. They cover their costs and make money from their actions. Similarly, if you use a Google login or a Facebook Login (or Microsoft, or Apple, or Amazon) then your actions are fully visible to these vendors. They know what you buy, where you were when you made the purchase, they know what you searched. They know it all.
To comply with GDPR, we had to download the font, host it on our own servers, and keep it all internal and private. It costs our team and the client a few ducats, or Euros, each month to comply. That’s ok. Well done us. We understood and followed the law (well, not me, I had to be taught about GDPR. Now I am a fan. I’d like to see similar laws here in the US.
During the episode, we explored the impact of assumptions we make. We explored historical assumptions and guidance about how data ought to be structured then I presented exceptions to these “rules”. When developing complex systems, we must find ways through complex challenges.

Keep in touch

Notification of new episodes and maybe something else

2020 Fijre Media LLC