Abstract
The Resource Description Framework (RDF) has become a very popular graph-based standard initially designed to represent information on the Web. Its flexibility motivated the use of this standard in other domains and today RDF datasets are big sources of information. In this line, the research on scalable distributed and parallel RDF processing systems has gained momentum. Most of these systems apply partitioning algorithms that use the triple, the finest logical data structure in RDF, as a distribution unit. This merely physical strategy implies losing the graph structure of the model causing performance degradation. We believe that gathering the triples storing the same logical entities first contributes not only to avoid scanning irrelevant data but also to create RDF partitions with an actual logical meaning. Besides, this logical representation allows defining partitions with a declarative language leaving aside implementation details. In this study, we give the formal definition and detail the algorithms to gather the logical entities, which we name graph fragments (\(\mathcal {G}f\)), used as distribution units for RDF datasets. The logical entities proposed, harmonize with the notion of partitions by instances (horizontal) and by attributes (vertical) in the relational model. We propose allocation strategies for these fragments, considering the case when replication is available and in which both fragments by instances and by attributes are considered. We also discuss how to incorporate our declarative partitioning definition language to the existing state of the art systems. Our experiments in synthetic and real datasets show that graph fragments avert data skewness. In addition, we show that this type of data organization exhibits quantitative promise in certain types of queries. All of the above techniques are integrated into the same framework that we called RDFPartSuite.