And can be run like this: $ ~ /bin/spark/bin/spark-submit -class flambo_gaming_stack_re \ The code this is derived from is available here at GitHub. The code here will be broken up and be pseudo-Clojure for demonstration purposes. Like with the users step, you should now have a directory at ~/data/gaming-stack-exchange-warehouse/posts.parquet containing the post data in Parquet format. $ ~ /bin/spark/bin/spark-submit -class flambo_gaming_stack_exchange.etl_posts \ " tags " ( DataTypes/StringType ) true ( Metadata/empty ))]))) " postType " ( DataTypes/IntegerType ) true ( Metadata/empty )) " ownerId " ( DataTypes/IntegerType ) true ( Metadata/empty )) However I will include the schema here to make following along with the queries easier. I'll omit most of the code here, but it is available on GitHub. This is almost identical to loading users. You should now have a directory at ~/data/gaming-stack-exchange-warehouse/users.parquet containing the user data in Parquet format. Target/flambo-gaming-stack-exchange-0.1.0-SNAPSHOT-standalone.jar $ ~ /bin/spark/bin/spark-submit -class flambo_gaming_stack_exchange.etl_users \ For simplicity in these examples we will run everything in local mode. saveAsParquetFile users-df ( str home " /data/gaming-stack-exchange-warehouse/users.parquet ")))) createDataFrame sql-ctx users user-schema)] Xml-users ( f/text-file sc ( str home " /data/gaming-stackexchange/Users.xml ")) Sql-ctx ( build-sql-context " ETL Users ") " reputation " ( DataTypes/IntegerType ) true ( Metadata/empty ))]))) " name " ( DataTypes/StringType ) true ( Metadata/empty )) " id " ( DataTypes/IntegerType ) true ( Metadata/empty )) " Spark function that reads in a line of XML and potentially returns a Row " :description " Example of using Spark and Flambo " Our project.clj looks like this: ( defproject flambo-gaming-stack-exchange " 0.1.0-SNAPSHOT " This post assumes you are using leiningen, some basic familiarity with either the Java or Scala Spark API, Spark 1.3.1 is installed in ~/bin/spark and that the March 2015 Gaming Stack Exchange Data Dump has been downloaded and extracted to ~/data/gaming-stackexchange The full code for this post is also available on Github. We'll be using data from the Stack Exchange Gaming Site as a toy dataset to work with them.įirst we will use Spark to convert the Stack Exchange files provided in XML into Apache Parquet format and then later we will use it to run some queries to find out things like which users have the highest reputation, as well as which ones like to play Dwarf Fortress. They periodically provide a creative commons licensed database dump. Stack Exchange is a network of question and answer websites with a variety of topics (the most popular one being Stack Overflow). Apache Spark is a is a fast and general engine for large-scale data processing (as in terabytes or larger data sets), and Flambo is a Clojure DSL for working with Spark.
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |