Few days ago I received a dataset which had too many rows and columns. Each columns separated with tab and new records were separated with newline. Therefore, I had to edit some fields of it and make some changes. The changes consisted of remove duplications, replace some word instead of the others, sort the file based on the first and second key, remove some columns from the file,count the number of the lines in the file.
Sounds difficult in Windows and for Windows user but in Linux it is like piece of cake to do all above tasks with few commands that can do magic for you.
Firstly for removing duplications from the file I used ‘uniq’ command, like below
$ uniq MyFile > NewFile && mv NewFile MyFile
In above command firstly all duplications will be removed and then the file will be replaced. Here ‘NewFile’ is actually like temp file. Imagine in the file, if one specific field has for example, four instances, after applying the changes three instances will be removed and just one is kept. Now, if you want to remove all four distance please see following example,
$ uniq -u MyFile > NewFile && mv NewFile MyFile
‘-u’ parameter makes sure that all instances are removed, in other word some records will be removed.
In order to replace specific words with other words in the file, I used ‘replace’ command like below,
$ replace OldWord NewWord < MyFile > NewFile && mv NewFile MyFile
In above example, ‘OldWord’ will be replaced by ‘NewWord’. If you have more than one words to change, just simply run this commands few times and change parameters.
Now for sorting the file based on the first and second fields, I used ‘sort’ command as I explained in this post earlier but with few change like below,
$ sort -k 1 -k 2 MyFile > NewFile && mv NewFile MyFile
Here is an issue for numerical fields. Sort command, sorts numerical fields like below,
1, 11, 2, 25, 29, 3
For solving such issue just add ‘-g’ parameter after sort like following example,
$ sort -g -k 1 -k 2 MyFile > NewFile && mv NewFile MyFile
Next is removing some columns from the file. For this one I used ‘cut’ command like below,
$ cut -f 1,6,10,15 MyFile > NewFile && mv NewFile MyFile
In above example, fields number 1,6,10,15 are extracted from the file and added to new file, then the new file name is renamed to the old file and change is applied. If you do not redirect the output to the new file, those fields are displayed in screen.
The final stage was counting number of the lines in the file, for this task I utilized ‘wc’ command like following,
$ wc -l MyFile
Originally the command gives number of characters, white space and so on. But for counting number of the lines I used ‘-l’ parameter.
As I mentioned earlier, it was piece of cake in Linux without doing one line of coding and programming, just utilizing the power of this Free and Open source OS. In addition, all mentioned commands are not limited to text files you can mixed them with other commands as well. For instance, for counting number of files in a directory you could use below combination of commands,
$ ls | wc -l
Finally, it is fun to redirects and manipulate commands results to another commands and if you do that, you will understand the power of this operating system better.
send your idea and information to firstname.lastname@example.org