I want to talk about data modeling. But not at the high level of performance, scalability, relationships, etc. Instead, I would like to discuss how to model some specific attributes that seem obvious at first but, in my experience, bring unexpected problems.
For example, you have the Product entity; is the price a number? Or you have a Job entity; is the status just a string?
Using a primitive value (number, text) for these attributes can bring issues in the future. What if you want to sell the product in different countries and the price attribute is just a number? Or, what if you want the status of Jobs to be related to only one workflow and you have only a text?
The examples of price and status show how some attributes that seem just a primitive value encapsulate a whole concept. Price encapsulates currency, country, even some dates. Status can be part of a workflow; it encapsulates time, users, next steps. A status means more than just some text.
Using primitive values too much is also known as “Primitive Obsession” and it’s not just applied to data modeling in databases, but also in our code and our types.
How do we identify these attributes and types? How do we know when to use a simple primitive value like a number or text versus creating a whole collection of attributes for it?
I propose a few questions to help identify when dealing with a concept.
If two attributes have the same value, are they equal?
If two job statuses have the same value, does that mean it’s the same status?
No, status can belong to different workflows or to different users, and even though they have the same value might mean other things.
Another example is with the amount attribute. If we have two amounts of “100”, do they mean the same? No, because one can be 100 units, while the other 100gr.
Can we have more than one value in that attribute? What differentiates them?
Can a user have more than one phone number? What differentiates them? For example, one kind of differentiation is the home phone or the work phone. A phone number has more than just the number; it also has a type of phone. Therefore, the phone is not just a number.
Can we identify one value with an “id”?
Take the example of a subcategory in a product. We might have the same subcategory belonging to different families or categories. In that case, it means that even though they have the same name, they are not the same, and we might want to identify each subcategory with an identifier.
If the value changes, does it still represent the same concept?
For example, can a tag be renamed and keep the same references? If we consider tags belonging to a user and editable, renaming a tag means changing the name of an entity. But, if we stored the tag as a text, how do we know which tag to rename, which tag belongs to which user?
These questions are not an exhaustive list. Yet, it’s a good starting point to identify and avoid future problems when modeling the data.
Do Not Overengineer
I am not proposing that you start using a collection of values always, with “first name” or age, for example. But, just like everything in life, you need to find a balance.
Find a balance between maintainability and simplicity, complexity and completeness.
The balance is context-dependent, so always consider the project when modeling data. Finding the balance is not easy, and two different projects might have different balances.
Think twice before using a primitive value.
If you like this post, consider sharing it with your friends on twitter or forwarding this email to them 🙈
Don't hesitate to reach out to me if you have any questions or see an error. I highly appreciate it.
And thanks to JC and Sebastià for reviewing this article 🙏
Thanks for reading, don't be a stranger 👋
GIMTEC is the newsletter I wish I had earlier in my software engineering career.
Every other Wednesday, I share an article on a topic that you won't learn at work.
Join more than 3,000 subscribers below.