In geospatial queries, we often need to quickly find all the points of interests (POIs) within a certain distance from an anchor point. In this post, we present a simple method that scales very well for billions of data points and implemented using plain SQL; so it can be deployed on a massive data processing systems like Redshift or Hive/SparkSQL on Hadoop without utilizing any geospatial support components.
Hadoop 2.x upgrades the previous web UI with a detailed ResourceManager. Having previously browsed the simpler JobTracker UI of Hadoop 1.x using lynx on the master node, finding things on the new interface took a bit of experimentation.
At Thinknear we always want to make sure we are doing our best to use the right tool for the job. So when Redshift came out we decided to evaluate our current reporting and analytics pipeline and see if Redshift could help us improve. At the time we were using Hive/Hadoop on EMR for all our reporting and analytics purposes. We saw Redshift as a way to speed up our reporting infrastructure without completely rearchitecting and give our business team a much easier way to access the data. Given these goals we evaluated Redshift against our current Hive/Hadoop solution and found the following pros and cons.