Internship PGX bij Oracle Labs



The PGX team at Oracle Labs focuses on high-performance shared-memory and distributed graph processing and has open internship positions available.

Oracle, a global provider of enterprise cloud computing, is empowering businesses of all sizes on their journey of digital transformation. Oracle Cloud provides leading-edge capabilities in software as a service, platform as a service, infrastructure as a service, and data as a service.
Oracle’s application suites, platforms, and infrastructure leverage both the latest technologies and emerging ones – including artificial intelligence, machine learning, blockchain, and Internet of Things – in ways that create business differentiation and advantage for customers. Continued technological advances are always on the horizon.

Oracle Labs
Oracle Labs is the advanced research and development arm of Oracle. We focus on the development of technologies that keep Oracle at the forefront of the computer industry. Oracle Labs researchers look for novel approaches and methodologies, often taking on projects with high risk or uncertainty, or that are difficult to tackle within a product- development organization. Oracle Labs research is focused on real-world outcomes: our researchers aim to develop technologies that will someday play a significant role in the evolution of technology and society. For example, chip multithreading and the Java programming language grew out of work done in Oracle Labs.

Parallel Graph AnalytiX (PGX)
PGX is a toolkit for graph analytics that supports graph algorithms, such as PageRank, graph queries with PGQL (an SQL-like graph query language), and graph ML. PGX includes both a single-machine in-memory engine and a distributed engine for extremely large graphs, and is already available as an option in Oracle products and an active research project at Oracle Labs.

Internship Details
The goal of this project is to extend PGX, both the single-machine runtime (PGX.SM) and the distributed runtime (PGX.D) with new capabilities. We offer various topics depending on the skills and the interests of the candidate (topics are not limited to the ones below; see also the "Related Topics" sub- section below):

  • Extended distributed computations
    PGX.D implements distributed PGQL queries using an asynchronous depth- first runtime. In this project, we will generalize and explore how to leverage, and possibly extend, this asynchronous depth-first runtime to support a broader scope of computations.
  • Distributed fault tolerance & graph snapshots
    Fault tolerance in data-analytics systems often relies on techniques such as snapshots (storing the data in persistent storage) and replication. In this project, we will explore various options for enhancing distributed fault tolerance for PGX.D, including snapshots and replication.
  • Distributed data/graph placement
    In this project, we will design and evaluate various distributed data/graph placement and partitioning techniques in the presence of concurrent users.
  • Distributed scalable engine for graph-based ML

Recent research shows that machine learning workloads can benefit from information encoded in the graph to achieve higher accuracy and faster convergence when learning models. In this project, we will explore, given the distributed nature of the graph, how it is possible to retrieve embeddings for ML algorithms from such distributed graphs efficiently for processing in external ML frameworks.

Extension of an SQL-like graph query processing engine (PGQL)
In this project, we will extend the semantics and implementation of the PGQL graph query language. Example topics include: (i) improving the composability of PGQL queries (i.e., starting a PGQL query from the results of a previous one, or from graph algorithms results) and optimizing the execution of such composed queries, and (ii) designing and implementing pipelined versions of the PGQL operators to reduce the peak memory consumption during query.

Dynamic data loading for very large graphs
Main memory is a limited resource. Consequently, in a data-analytics engine, such as PGX.SM, only the most recent or most important data should can be kept in memory, and other data can be offloaded to external storage/systems. During this internship, we will extend PGX.SM support of dynamically loading of data that is present in offloaded systems, in a graceful, efficient and transparent manner.

The successful candidate is expected to complete the internship using a wide and diverse set of skills.
Required Skills

  • Basic understanding of parallel, concurrent, and distributed programming (having completed relevant courses, such as Distributed / Concurrent Algorithms, is a plus);
  • For PGX.D: C/C++ programming skills. Experience with Java is a plus;
  • For PGX.SM: Java programming skills;
  • Ability to design and implement reliable and documented high- performance software, including tests;
  • Good problem-solving skills;
  • Experience with Linux (e.g., bash scripts);
  • Familiarity with graph algorithms is a plus;

For more information about the internship, please contact Vasileios Trigonakis or Damien Hilloulin.