Categories
Clean your input data well
Input data is the foundation of your spreadsheet, so pay serious attention to clean it properly. If possible, clean it in such a way that it becomes an automated process for the next time you use the same spreadsheet.
Check your input sources
Does the following situation sound familiar to you?
Your spreadsheet has several input sources, from different systems, and the format of this data is different for each system. The data received seems to be incomplete, and it also includes a lot of duplicates. One system uses points for decimals, the other uses commas. It contains complete tables that are of no use to you at all, and the date used is 11.03.2018, does this mean 3 November 2018 or 11 March 2018? So many issues and the deadline for your data analysis is tomorrow…
Most Excel users dislike the process of data cleansing and prefer to start their analysis straight away, but don’t underestimate the importance of clean input data! No matter how much pressure there is from your organization to deliver the analysis quickly.
Quick wins in the data set
Often there is a lot you can do in the dataset itself to prepare it for analysis: replace points with commas, delete lines, delete duplicates, etc. Please note: a manually modified dataset is very difficult to validate by controllers and accountants. The connection between the original input data and the modified “Excel source data” is a well-known frustration. So, when possible, clean your data using formulas and reserve separate tabs for the original source data (1 source, 1 tab) and the final, clean input data.
Do not import more than necessary
Do you get more information from your input source than you actually need? Don’t burden your file with it. Imported tables that you don’t use shouldn’t be included in your spreadsheet. If possible, run the export scripts of other systems in such a way that they don’t export useless, additional columns. At the very least, make sure that the tab with the clean input data does not contain the unnecessary data.
Are numbers recognized as numbers?
In some import files numbers are not recognized as such in Excel. Figures remain text fields, with all kinds of annoying consequences for more complex analyses. A one-time trick is to select the column in question and convert it to numbers via the menu Data-> text to data (don’t ask us why it works, but it works). You will have to do this trick over and over again and then you will have to deal with the verifiability issues again. Fortunately, there is the formula, =NUMBERVALUE(A2), which helps you indicate which symbol is used for decimals in the source.
Read more about this subject in the article Pay attention to units and number formats
How to handle empty fields
There are datasets in which, for example, a “main category” of something is listed in column A, but only if the main category is not equal to the main category of the row above. The human eye understands this, but Excel doesn’t. Fill these empty cells in the enriched data set with functions such as =IF(SOURCE!A3=”;A2;SOURCE!A3)
.
Unneeded spaces in text fields
Unneeded spaces in text fields might not seem important at a first glance, but imported text data in Excel often ends up with one or more spaces. If this happens inconsistently, you’ll have problems with search functions and pivot tables. The “=TRIM(SOURCE!A2)
” function is a powerful way to correct this. If you want to combine this with consistent capitalization, you can use “=PROPER(TRIM(SOURCE!A2))
”.
In summary
Devote time and attention to prepare clean input data by using formulas. This will save you a lot of misery later on. Garbage can still result in a great and clean analysis.