Many datasets are in the JSON Lines format, with one JSON object per line.
Just drop your dataset on an HDFS drive and RumbleDB can read it directly from there, and output right back to HDFS. Or on your screen in an interactive shell.
RumbleDB also works with many other formats (Text, Parquet, CSV, SVM, ROOT...) and can read from many other file systems (local, S3, Azure, ...), and the list is growing.
JSONiq is a declarative and functional language. It is user-friendly and easy to read and write, because it looks a lot like JSON.
JSONiq is as simple as SQL, but can do much more.
Often, datasets are not in first normal form and data can be nested at multiple levels. But SQL was not made for nested data.
Forget EXPLODE() calls in Spark SQL and dot projections. JSONiq was born to read and write nested data.
Often datasets are heterogeneous, especially when they are accumulated over years with evolving schemas.
When this happens, it does not fit in DataFrames and before RumbleDB, one had no choice but dealing with this manually.
If a field is not always associated with a value of the same type: no worries. JSONiq handles this seamlessly.
Under the hood, RumbleDB uses Apache Spark. Not all of us have clusters: You can use it to query JSON from your local drive. RumbleDB will spread the computations on your cores.
RumbleDB runs on large clusters of machines, with the data lying on any layer supported by Spark: HDFS, S3, ... We have tested RumbleDB with up to 64 machines, as well as collections of more than 20 billion objects (10+ TB), but it supports any sizes supported by Apache Spark.