Geeks With Blogs
Josh Reuben Spark
BigQuery QuickRef
Big Data keeps evolving. Stone Age Hadoop was alot of Java bolierplate for defining HDFS access, Mapper & Reducer. This was superceded by Bronze Age Spark, which provided a succint Scala unification of:ML pipelinesin-memory structured DataSets over RDDs via a SparkSession SQL APIDistributed Streams(Note: You can run such jobs easily in a dynamically scalable manner on Google Dataproc) Technology keeps evolving - the Big-Iron Age has arrived in the form of Google Cloud Platform's SPARK KILLER ......

Posted On Thursday, December 15, 2016 5:33 AM

Hive - HQL query over MapReduce
Overview Developed by Facebook HiveQL is a SQL-like framework for data warehousing on top of MapReduce over HDFS. converts SQL query into a series of jobs for execution on a Hadoop cluster. Organizes HDFS data into tables - attaching structure. Schema on Read Versus Schema on Write - doesn’t verify the data when it is loaded, but rather when a query is issued. full-table scans are the norm and a table update is achieved by transforming the data into a new table. HDFS does not provide in-place file ......

Posted On Tuesday, March 22, 2016 5:32 AM

Big Data File Format Zoo
Big Data has a plethora of Data File Formats - its important to understand their strengths and weaknesses. Most explorers start out with some NoSQL exported JSON data. However, specialized data structures are required - because putting each blob of binary data into its own file just doesn’t scale across a distributed filesystem. TL/DR; Choose Parquet !!! Row-oriented File Formats (Sequence, Avro ) – best for large number of columns of a single row needed for processing at the same time. general-purpose, ......

Posted On Wednesday, March 16, 2016 6:15 AM

Copyright © JoshReuben | Powered by: