Qubole has open-sourced Sparklens, a spark profiler and performance prediction tool. We published blogs [1] and [2] on Sparklens previously, and this is an update about the new features and fixes in the latest release 0.2.0. We have focused on ease of usage while putting in place structures for later projects. We have also introduced some more accurate methods for resource wastage-calculations.
Ability to run Sparklens offline
Sparklens is run as part of spark-app. During the execution time, Sparklens listens to spark events via the SparkListener interface. When it gets the Application Finished event, Sparklens starts generating the performance and simulation report. But this reporting runs on driver, and may add time-penalty to the app, eg: running simulations using a different number of executors. We have now decoupled Sparklens-reporting from the Application Finished event. Sparklens will now write a small json file which contains all the data it needs to generate the performance and simulation report. Instead of looking for Sparklens output in log files, now we can generate this report any time by using these json files.
This Sparklens json file is generated for all spark-app run with Sparklens. Sparklens would make sure is that this json data file would be compatible with later versions as well..
Ability to run Sparklens using event history files
Sparklens now also supports reporting using spark’s event-history files, which are generated for all spark-apps by default. This gives one ability to inspect any previously run app, or other apps not run with Sparklens as well. Spark event history files are typically very large. This approach is supported but not recommended. Processing and reporting would take more time using event-history file.
- Running from Sparklens data files. Eg:
./bin/spark-submit --packages qubole:sparklens:0.2.0-s_2.11 --class com.qubole.sparklens.app.ReporterApp qubole-dummy-arg /tmp/sparklens/local-1533638931964.sparklens.json
- Running from event-history file. Eg:
./bin/spark-submit --packages qubole:sparklens:0.2.0-s_2.11 --class com.qubole.sparklens.app.ReporterApp qubole-dummy-arg /tmp/spark-history/application_1520833877547_0285.lz4 source=history
Please refer for details: https://github.com/qubole/sparklens#how-to-use-sparklens
Executor count for resources wasted
Of the multiple insights in Sparklens, one is analysis of efficiency in terms of how much compute resources were not utilized, but were available. For this, Sparklens depends upon the number of executors available to the app. This approach turns out to be wrong when dynamic-allocation of executors is enabled, and idle-timeout is set to low values (eg: 1 minute on Qubole clusters). Executors may go away, and come back at some later time as per need of jobs/stages of the app. Sparklens was not considering the executors which had gone away, and in some cases we were getting wrong estimation of resources wasted (Please see “earlier calculation of executors” in graph). Eg:
Total OneCoreComputeHours available 8880h 02m Total OneCoreComputeHours available (AutoScale Aware) 917h 41m OneCoreComputeHours wasted by driver 550h 21m
Here the total compute available is a blown-up number because multiple executors had come up and gone down during this spark-app’s run, although the total number of nodes available might not have changed at all. Eg: In the figure, at time T=7 and T =14, new executors have come up, mostly on the same machines where executors had gone down previously, thus not adding to the overall cost.
Now, we have started using “maximum concurrent executors” during the course of spark-app runtime to get estimation of under-utilization. This gives us a correct estimation of how much resource has gone under-utilized in scenarios without auto-scaling of cluster size, and a good estimation for the same with autoscaling and aggressive idle timeouts of cluster-size (see figure below).
Worst case maximum memory used
Sparklens also tells what could be the maximum memory (spark.memory.fraction) used by an executor in the course of spark-app run. We previously reported this number per stage, but now we report the maximum among all the stages and jobs also. If this number is closer or higher than the memory configured for executors, one should deliberate increasing executor’s memory.
We will have more features and updates on Sparklens in coming weeks, please stay tuned and feel free to pitch in with ideas/issues on https://github.com/qubole/sparklens/issues