Data sanitisation

1 min read    17 May 2017    

12:05 Learnt a new term today - data sanitization. That means conversion of some attributes, if they’re not in the expected format. I was supposed to index ~80K orders, the code for which ran for ~40 minutes. I had indexed 32K, but the rest didn’t. Due to some confusion in db name, turns out it looked like my script didn’t work at all. Thanks to the senior dev’s clarification, I felt relieved I don’t write completely shitty code. The error was, some fields expected Date type var, received String. Also, elasticsearch’s bulk was some inherent errors, which it does not show properly. So it showed all orders were indexed, because I didn’t have a great grasp on elasticsearch’s concepts. The first task now is to sanitise the data in order to index all the 80K orders.

  • Either write a Mongo script to handle this.
  • Or handle this in code.

I’m not really feeling adventorous right now, so I’ll stick with handling this part in the code itself.

Reminds me, I have to sanitise my own blog [and complete it as well], because this isn’t making publicisable yet.

Apparently, ids can be compared since they’re just strings

20:49 Now, some of the orders were f*ed up during some orders, so that literally increased my work by a day. 2 magical queries [surprisingly written by me] needed to unf**k up the originally messed up orders. However, NDA.

Even after using elasticsearch, the total space consumed by 80K orders was 292mb !

Add causation waala Rick and Morty waala video. Watching random shit, and realising there can never be surity of what came first. Did Rama build the stone bridge to Lanka, or was it already there, and whoever wrote the story was like, wow I can use this shit to fool people.

Indexed 80000 orders in 1.89 hours, seems the staging db has more orders.

Some rights reserved.

Leave a Comment