People understand a spreadsheet. It is visual, the data is right there, and you can perform calculations right next to the data. For many scenarios this is the right tool to use.
Some people have even figured out ways to power their mobile applications off of using google sheets. (https://www.appsheet.com/apps-from-google-sheets-and-forms)
The scenario we will be exploring is the management of data within a business. There are big disadvantages to using spreadsheets to do this, which databases handle very well.
Spreadsheets typically top out at a million rows of data. Databases can store practically an unlimited number of rows of data. Spreadsheets can have multiple sheets which can allow for complex calculations between multiple sets of data, but these become difficult to manage visually. Databases again can support practically an unlimited number of tables and make working with all that data much more simple through SQL queries.
Keeping data accurate is very difficult. Imagine updating the following sheet:
Now Imagine if the city of San Francisco (SF) made the horrible choice to officially change its name to Frisco (F). Now, while we might disagree with this decision, when we go to update the spreadsheet we have to update the value in multiple places. This creates an opportunity for error. Perhaps we missed one.
In a spreadsheet we could do ctrl F or use the Find and Replace tool, but this might break down. For example, if the data is spread across multiple sheets, it would be easy to miss one. In databases using SQL and 3rd Normal Form(more on this later) we can make these updates easily and accurately.
Spreadsheets have a binary level of access, either you have access to the file or you do not. Databases allow for complex permission settings to not only prevent certain people from being able to see certain parts of the data, but also to encrypt data within the database. This allows for compliance with regulations like GDPR(General Data Protection Regulation) and also protects the data from being edited by the wrong people.
To access the data within a database you will need to use SQL. While this is a fairly simplistic programming language it is not as intuitive as using a spreadsheet. There is a learning curve here.
Another pain point with working with data in a database is getting access. In many companies the amount of people who can run queries is fairly limited. Typically you will need to request access from the data team or your IT group.
Data is structured differently than in a spreadsheet. It is stored in what’s called 3rd Normal Form. This means that the data you want is typically spread out across multiple tables which you will need to join together.
New jargon always takes time to sink in. These words will be used throughout the book and will help make conversations about data more clear.
Table, Field and Record all mean something very similar to common spreadsheet words. Their pairs are as follows:
Primary Key and Foreign Key are database specific concepts. Data in a database is spread out across multiple tables. Because of this, we need ways of bringing all that data back together. This is accomplished by using Primary and Foreign Keys. The Primary Key is the ID field in a table, that uniquely identifies each record.
The primary key is indicated here with PK_id:
Then in another table the primary key for the location table might exist as well, this is called a Foreign Key.
The Foreign Key is indicated here as FK_location_id:
Now we can join these two tables together using the Primary and Foreign Keys to make one big table:
If we zoom out to see these two tables in the same database, before they were joined, we can explain the last jargon word, Schema. Schema is the word used to describe all of the tables that exist within the database:
The tables in a schema are commonly represented by showing the table name at the top with all of the field names listed below:
Ok there is actually more new jargon and these require special attention since most of the errors in writing SQL statements will be related to data type issues. Data types specify what kind of data is stored and what kind of functions can be called on that data. Every field in a database has a data type associated with it.
VARCHAR - Character string having variable length (i.e. Words, letters and numbers, and sentences)
INTEGER - Whole Numbers
FLOAT - Numbers with decimal point values
BOOLEAN - True or False value
DATE - Date/Time value
There is one data type that is not defined for fields that can be used in any of these.
NULL - this means there is no value in the cell