/ 0
60%
Table of contents

Document in text mode:

Titian:DataProvenanceSupportinSparkMatteoInterlandiKshitijShahSaiDeepTetaliMuhammadAliGulzarSeunghyunYooMiryungKimToddMillsteinTysonCondieUniversityofCalifornia,LosAngelesABSTRACTDebuggingdataprocessinglogicinData-IntensiveScalableComputing(DISC)systemsisadifficultandtimeconsum-ingeffort.Today’sDISCsystemsofferverylittletoolingfordebuggingprograms,andasaresultprogrammersspendcountlesshourscollectingevidence(e.g.,fromlogfiles)andperformingtrialanderrordebugging.Toaidthiseffort,webuiltTitian,alibrarythatenablesdataprovenance—trackingdatathroughtransformations—inApacheSpark.DatascientistsusingtheTitianSparkextensionwillbeabletoquicklyidentifytheinputdataattherootcauseofapo-tentialbugoroutlierresult.TitianisbuiltdirectlyintotheSparkplatformandoffersdataprovenancesupportatinteractivespeeds—orders-of-magnitudefasterthanalterna-tivesolutions—whileminimallyimpactingSparkjobperfor-mance;observedoverheadsforcapturingdatalineagerarelyexceed30%abovethebaselinejobexecutiontime.1.INTRODUCTIONData-Intensive...