When constructing Spark Machine Learning Pipelines - I find it really helpful to maintain a bird's eye view of the various transformers and estimators available.in a nutshell: fit trainingData (train a model), transform testData (predict with model)Transformer: DataFrame => DataFrameEstimator: DataFrame => TransformerTransformersToke... sentence => wordsRegexTokenizer: sentence => words - setPatternHashingTF: terms => feature vectors based on frequency - setNumFeaturesStopWordsRemo... ......
System Architecture patternsN-TierEvent-Driven - Mediator / BrokerMicrokernelMicroServi... - MVC / MVP / MVVMserver - RPC / Remoting / WS / SOA / RESTSpace-BasedSOA PatternsFoundational StructuralService Host - infraActive Service - worker thread for upstream pre-fetchTransactional ServiceWorkflowEdge ComponentQoS PatternsDecoupled Invocation - Queues for reliability, burstsParallel Pipelines - Steps -> throughputGridable ServiceService Instance - mutiple stateless copies, NLBVirtual ......
Overview Developed by Facebook HiveQL is a SQL-like framework for data warehousing on top of MapReduce over HDFS. converts SQL query into a series of jobs for execution on a Hadoop cluster. Organizes HDFS data into tables - attaching structure. Schema on Read Versus Schema on Write - doesn’t verify the data when it is loaded, but rather when a query is issued. full-table scans are the norm and a table update is achieved by transforming the data into a new table. HDFS does not provide in-place file ......
Big Data has a plethora of Data File Formats - its important to understand their strengths and weaknesses. Most explorers start out with some NoSQL exported JSON data. However, specialized data structures are required - because putting each blob of binary data into its own file just doesn’t scale across a distributed filesystem. TL/DR; Choose Parquet !!! Row-oriented File Formats (Sequence, Avro ) – best for large number of columns of a single row needed for processing at the same time. general-purpose, ......