Overlap-related operations

Single and pairwise PyRanges operations 

Pyranges offers many efficient methods to detect / process overlaps. We present them here split in two groups:

Methods accepting a pair of PyRanges as input:

`pyranges1.PyRanges.overlap`(other[, ...])	Return overlapping intervals.
`pyranges1.PyRanges.join_overlaps`(other, *[, ...])	Join PyRanges based on genomic overlap.
`pyranges1.PyRanges.set_intersect_overlaps`(other)	Return set-theoretical intersection.
`pyranges1.PyRanges.set_union_overlaps`(other)	Return set-theoretical union.
`pyranges1.PyRanges.intersect_overlaps`(other)	Return overlapping subintervals.
`pyranges1.PyRanges.subtract_overlaps`(other)	Subtract intervals, i.e. return non-overlapping subintervals.
`pyranges1.PyRanges.count_overlaps`(other[, ...])	Count number of overlaps per interval.
`pyranges1.PyRanges.nearest_ranges`(other[, ...])	Find closest interval.

Methods accepting a single PyRanges as input:

`pyranges1.PyRanges.cluster_overlaps`([...])	Give overlapping intervals a common id.
`pyranges1.PyRanges.merge_overlaps`([...])	Merge overlapping intervals into one.
`pyranges1.PyRanges.split_overlaps`([...])	Split into non-overlapping intervals.
`pyranges1.PyRanges.max_disjoint_overlaps`([...])	Find the maximal disjoint set of intervals.

Methods for pairs of PyRanges 

These methods accept two PyRanges objects as input (i.e. self and other), and return a new PyRanges object with the results of the operation.

We will briefly showcase them in this document. First, let’s create two PyRanges objects to work with:

>>> import pyranges1 as pr
>>> a = pr.PyRanges(dict(Chromosome="chr1",
...                      Start=[3, 13, 18, 23, 28, 32, 33],
...                      End=[6, 15, 21, 27, 29, 37, 36],
...                      Strand=["+", "+", "-", "-", "-", "+", "+"]))
>>> a
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                3        6  +
      1  |    chr1               13       15  +
      2  |    chr1               18       21  -
      3  |    chr1               23       27  -
      4  |    chr1               28       29  -
      5  |    chr1               32       37  +
      6  |    chr1               33       36  +
PyRanges with 7 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

>>> b = pr.PyRanges(dict(Chromosome="chr1",
...                      Start=[6, 12, 19, 25, 34],
...                      End=[8, 14, 20, 29, 36],
...                      Strand=["+", "+", "+", "-", "+"]))
>>> b
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                6        8  +
      1  |    chr1               12       14  +
      2  |    chr1               19       20  +
      3  |    chr1               25       29  -
      4  |    chr1               34       36  +
PyRanges with 5 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Filter rows: overlap 

The most intuitive overlap-related method is overlap. This is simply a filter of the intervals in self, so that only those overlapping any interval in other are returned:

>>> a.overlap(b)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      1  |    chr1               13       15  +
      3  |    chr1               23       27  -
      4  |    chr1               28       29  -
      5  |    chr1               32       37  +
      6  |    chr1               33       36  +
PyRanges with 5 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Let’s now filter the rows of the other PyRanges object:

>>> b.overlap(a)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      1  |    chr1               12       14  +
      3  |    chr1               25       29  -
      4  |    chr1               34       36  +
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Note above how intervals overlapping with more than one interval in the other PyRanges are not reported multiple times. Use multiple=True to do so.

To get the self intervals without overlap in other, use invert=True:

>>> a.overlap(b, invert=True)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                3        6  +
      2  |    chr1               18       21  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

A reminder that intervals are encoded in pythonic convention: 0-based coordinates, with start included and end excluded. Thus, the closest possible intervals that are not overlapping are two intervals wherein the end of the first is equal of the start of the second. These are called “bookended” intervals, e.g.:

>>> a.head(1)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                3        6  +
PyRanges with 1 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

>>> b.head(1)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                6        8  +
PyRanges with 1 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

Common arguments: slack, strand_behavior, match_by 

We will now use overlap to showcase arguments that are available in many overlap-related methods.

slack (default: 0) is used to relax the criteria of overlap. A value of 1 will report bookended intervals, previously not considered overlapping:

>>> b.overlap(a, slack=1)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                6        8  +
      1  |    chr1               12       14  +
      3  |    chr1               25       29  -
      4  |    chr1               34       36  +
PyRanges with 4 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Analogously, higher values will report increasingly distant intervals. In practice, the self intervals are temporarily extended by the slack amount on both ends before the overlap operation.

strand_behavior determines how strand is treated. The value ‘same’ results in the intuitive behavior, i.e. two intervals overlap only if on the same strand:

>>> b.overlap(a, strand_behavior="same")  # the result here is the same as: a.overlap(b)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      1  |    chr1               12       14  +
      3  |    chr1               25       29  -
      4  |    chr1               34       36  +
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

On the other hand, value ‘ignore’ will define overlaps regardless of strands:

>>> a.overlap(b, strand_behavior="ignore")
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      1  |    chr1               13       15  +
      2  |    chr1               18       21  -
      3  |    chr1               23       27  -
      4  |    chr1               28       29  -
      5  |    chr1               32       37  +
      6  |    chr1               33       36  +
PyRanges with 6 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Value ‘opposite’ will require two intervals to be on the opposite strands to be considered overlapping:

>>> a.overlap(b, strand_behavior="opposite")
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      2  |    chr1               18       21  -
PyRanges with 1 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

Naturally, values ‘same’ and ‘opposite’ can only be used when the PyRanges objects have valid strand information, i.e. the Strand column is present and all its values are either ‘+’ or ‘-‘:

>>> b_unstranded = b.remove_strand()
>>> b_unstranded
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                6        8
      1  |    chr1               12       14
      2  |    chr1               19       20
      3  |    chr1               25       29
      4  |    chr1               34       36
PyRanges with 5 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

>>> a.overlap(b_unstranded, strand_behavior="same")
Traceback (most recent call last):
  ...
ValueError: Can only do same strand operations when both PyRanges contain valid strand info.

See function strand_valid for details, and make_strand_valid to convert non-standard strand values to standard ones.

The default value of strand_behavior is ‘auto’. This is transformed to ‘same’ if both PyRanges have valid strands, and to ‘ignore’ otherwise:

>>> a.overlap(b).equals(
... a.overlap(b, strand_behavior='same') )
True

>>> a.overlap(b_unstranded).equals(
... a.overlap(b, strand_behavior='ignore') )
True

Above, we leveraged method equals inherited from pandas Dataframe to compare table contents.

Note that the presence of any non-standard Strand value will result in strand being ignored for all rows. When leading to potentially non-intuitive behavior, a warning is printed:

>>> a_invalid = a.copy()
>>> a_invalid.loc[2, 'Strand'] = "."
>>> a_invalid
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                3        6  +
      1  |    chr1               13       15  +
      2  |    chr1               18       21  .
      3  |    chr1               23       27  -
      4  |    chr1               28       29  -
      5  |    chr1               32       37  +
      6  |    chr1               33       36  +
PyRanges with 7 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 3 strands (including non-genomic strands: .).

>>> a_invalid.overlap(b)  
<input>:1: UserWarning: overlap: 'auto' strand_behavior treated as ignore due to invalid Strand values. Please use strand_behavior=ignore
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      1  |    chr1               13       15  +
      2  |    chr1               18       21  .
      3  |    chr1               23       27  -
      4  |    chr1               28       29  -
      5  |    chr1               32       37  +
      6  |    chr1               33       36  +
PyRanges with 6 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 3 strands (including non-genomic strands: .).

Finally, argument match_by can be used to specify additional columns whose values must match for two intervals to be considered overlapping. For example, let’s add a column to both objects to mark intervals whose Start is an odd number, then use this column to filter the overlaps:

>>> a2 = a.assign(odd = lambda x:x.Start % 2 )
>>> a2
  index  |    Chromosome      Start      End  Strand        odd
  int64  |    str             int64    int64  str         int64
-------  ---  ------------  -------  -------  --------  -------
      0  |    chr1                3        6  +               1
      1  |    chr1               13       15  +               1
      2  |    chr1               18       21  -               0
      3  |    chr1               23       27  -               1
      4  |    chr1               28       29  -               0
      5  |    chr1               32       37  +               0
      6  |    chr1               33       36  +               1
PyRanges with 7 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

>>> b2 = b.assign(odd = lambda x:x.Start % 2 )
>>> b2
  index  |    Chromosome      Start      End  Strand        odd
  int64  |    str             int64    int64  str         int64
-------  ---  ------------  -------  -------  --------  -------
      0  |    chr1                6        8  +               0
      1  |    chr1               12       14  +               0
      2  |    chr1               19       20  +               1
      3  |    chr1               25       29  -               1
      4  |    chr1               34       36  +               0
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

>>> a2.overlap(b2, match_by='odd')
  index  |    Chromosome      Start      End  Strand        odd
  int64  |    str             int64    int64  str         int64
-------  ---  ------------  -------  -------  --------  -------
      3  |    chr1               23       27  -               1
      5  |    chr1               32       37  +               0
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

The most versatile: join_overlaps 

The most versatile overlap-related method for pairs of PyRanges is join_overlaps. This method is analogous to a SQL join operation, but rather than matching rows in two tables through a common key, they are matched in virtue of their overlap.

This function searches for overlaps between the intervals in self and other, and reports in output the full information related to the input intervals. The returned PyRanges object will have a number of rows equal to the number of overlaps found, and the columns will be the union of the columns of self and other, using a suffix to differentiate columns in other which are present with the same name in the self PyRanges, like Start and End:

>>> a.join_overlaps(b)
  index  |    Chromosome      Start      End  Strand      Start_b    End_b
  int64  |    str             int64    int64  str           int64    int64
-------  ---  ------------  -------  -------  --------  ---------  -------
      1  |    chr1               13       15  +                12       14
      3  |    chr1               23       27  -                25       29
      4  |    chr1               28       29  -                25       29
      5  |    chr1               32       37  +                34       36
      6  |    chr1               33       36  +                34       36
PyRanges with 5 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

In contrast to overlap, a row is returned per overlap, so if an interval in self overlaps with more than one interval in other, it will be reported multiple times, resulting in index duplicates (see below); you may call .reset_index() to correct this.

>>> b.join_overlaps(a)
  index  |    Chromosome      Start      End  Strand      Start_b    End_b
  int64  |    str             int64    int64  str           int64    int64
-------  ---  ------------  -------  -------  --------  ---------  -------
      1  |    chr1               12       14  +                13       15
      3  |    chr1               25       29  -                23       27
      3  |    chr1               25       29  -                28       29
      4  |    chr1               34       36  +                32       37
      4  |    chr1               34       36  +                33       36
PyRanges with 5 rows, 6 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes and 2 strands.

Like all overlap-related methods accepting two PyRanges as input, join_overlaps accepts the strand_behavior argument. If Strand is not used to determine overlaps, it will be returned for both PyRanges:

>>> a.join_overlaps(b, strand_behavior="ignore")
  index  |    Chromosome      Start      End  Strand      Start_b    End_b  Strand_b
  int64  |    str             int64    int64  str           int64    int64  str
-------  ---  ------------  -------  -------  --------  ---------  -------  ----------
      1  |    chr1               13       15  +                12       14  +
      2  |    chr1               18       21  -                19       20  +
      3  |    chr1               23       27  -                25       29  -
      4  |    chr1               28       29  -                25       29  -
      5  |    chr1               32       37  +                34       36  +
      6  |    chr1               33       36  +                34       36  +
PyRanges with 6 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

slack and match_by arguments are also available, e.g.:

>>> a2.join_overlaps(b2, match_by='odd')
  index  |    Chromosome      Start      End  Strand        odd    Start_b    End_b
  int64  |    str             int64    int64  str         int64      int64    int64
-------  ---  ------------  -------  -------  --------  -------  ---------  -------
      3  |    chr1               23       27  -               1         25       29
      5  |    chr1               32       37  +               0         34       36
PyRanges with 2 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

>>> a2.join_overlaps(b2, match_by='odd', slack=5)
  index  |    Chromosome      Start      End  Strand        odd    Start_b    End_b
  int64  |    str             int64    int64  str         int64      int64    int64
-------  ---  ------------  -------  -------  --------  -------  ---------  -------
      1  |    chr1               13       15  +               1         19       20
      3  |    chr1               23       27  -               1         25       29
      5  |    chr1               32       37  +               0         34       36
PyRanges with 3 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Pyranges provides method combine_interval_columns to post-process the output of join_overlaps and aggregate the coordinates of the overlapping intervals in Start and End columns. For example, this allows to obtain the union of the overlapping intervals:

>>> a2.join_overlaps(b2, match_by='odd', slack=5).combine_interval_columns('union')
  index  |    Chromosome      Start      End  Strand        odd
  int64  |    str             int64    int64  str         int64
-------  ---  ------------  -------  -------  --------  -------
      1  |    chr1               13       20  +               1
      3  |    chr1               23       29  -               1
      5  |    chr1               32       37  +               0
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

join_overlaps is one of most versatile methods in PyRanges, since it retains the full information of the two input PyRanges objects. Nevertheless, more efficient alternative methods are available for specific use cases, detailed below.

Set operations: set_intersect_overlaps, set_union_overlaps 

Pyranges offers efficient methods based on the concept of set operations in mathematics. These are useful when the user is interested in the intervals themselves, rather than in the full information (i.e. metadata) of the input intervals.

Method set_intersect_overlaps, allows to obtain the genomic regions present in both PyRanges:

>>> a.set_intersect_overlaps(b)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1               13       14  +
      1  |    chr1               34       36  +
      2  |    chr1               25       27  -
      3  |    chr1               28       29  -
PyRanges with 4 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

>>> a.set_intersect_overlaps(b, strand_behavior="ignore")
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1               13       14
      1  |    chr1               19       20
      2  |    chr1               25       27
      3  |    chr1               28       29
      4  |    chr1               34       36
PyRanges with 5 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

The regions reported may be part of any interval in the two PyRanges. All metadata columns are dropped:

>>> a2.set_intersect_overlaps(b2).columns  # see above: a2 and b2 had the 'odd' column
Index(['Chromosome', 'Start', 'End', 'Strand'], dtype='str')

Analogously, method set_union_overlaps allows to obtain the genomic regions that are present in at least one of the PyRanges:

>>> a.set_union_overlaps(b)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                3        6  +
      1  |    chr1                6        8  +
      2  |    chr1               12       15  +
      3  |    chr1               19       20  +
      4  |    chr1               32       37  +
      5  |    chr1               18       21  -
      6  |    chr1               23       29  -
PyRanges with 7 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

>>> a2.set_union_overlaps(b2, strand_behavior='ignore')
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                3        6
      1  |    chr1                6        8
      2  |    chr1               12       15
      3  |    chr1               18       21
      4  |    chr1               23       29
      5  |    chr1               32       37
PyRanges with 6 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

Interval manipulation operations: intersect_overlaps, subtract_overlaps 

Set operations do not preserve input metadata. join_overlaps preserve metadata of both PyRanges, but is less efficient. Pyranges also offers methods that preserve the metadata in self, but not in other. Specifically, method intersect_overlaps allows to obtain the intervals in self that overlap with any interval in other. It is similar to overlap, but here coordinates are modified to return only the actual overlaps:

>>> a2.intersect_overlaps(b)
  index  |    Chromosome      Start      End  Strand        odd
  int64  |    str             int64    int64  str         int64
-------  ---  ------------  -------  -------  --------  -------
      1  |    chr1               13       14  +               1
      3  |    chr1               25       27  -               1
      4  |    chr1               28       29  -               0
      5  |    chr1               34       36  +               0
      6  |    chr1               34       36  +               1
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

>>> a2.intersect_overlaps(b2, strand_behavior='ignore', match_by='odd')
  index  |    Chromosome      Start      End  Strand        odd
  int64  |    str             int64    int64  str         int64
-------  ---  ------------  -------  -------  --------  -------
      3  |    chr1               25       27  -               1
      5  |    chr1               34       36  +               0
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Method subtract_overlaps allows to obtain the portions of intervals in self that do not overlap any interval in other:

>>> a2.subtract_overlaps(b).sort_values('Start')
  index  |    Chromosome      Start      End  Strand        odd
  int64  |    str             int64    int64  str         int64
-------  ---  ------------  -------  -------  --------  -------
      0  |    chr1                3        6  +               1
      1  |    chr1               14       15  +               1
      2  |    chr1               18       21  -               0
      3  |    chr1               23       25  -               1
      5  |    chr1               32       34  +               0
      6  |    chr1               33       34  +               1
      5  |    chr1               36       37  +               0
PyRanges with 7 rows, 5 columns, and 1 index columns (with 1 index duplicates).
Contains 1 chromosomes and 2 strands.

>>> a2.subtract_overlaps(b, strand_behavior='ignore').sort_values('Start')
  index  |    Chromosome      Start      End  Strand        odd
  int64  |    str             int64    int64  str         int64
-------  ---  ------------  -------  -------  --------  -------
      0  |    chr1                3        6  +               1
      1  |    chr1               14       15  +               1
      2  |    chr1               18       19  -               0
      2  |    chr1               20       21  -               0
      3  |    chr1               23       25  -               1
      5  |    chr1               32       34  +               0
      6  |    chr1               33       34  +               1
      5  |    chr1               36       37  +               0
PyRanges with 8 rows, 5 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes and 2 strands.

Fast counting: count_overlaps 

Method count_overlaps allows to count, for each interval in self, the number of intervals in other that overlaps with it. Input coordinates are not modified, and a new column is added:

>>> a2.count_overlaps(b) # using a2 to show the 'odd' column is preserved
  index  |    Chromosome      Start      End  Strand        odd     Count
  int64  |    str             int64    int64  str         int64    uint32
-------  ---  ------------  -------  -------  --------  -------  --------
      0  |    chr1                3        6  +               1         0
      1  |    chr1               13       15  +               1         1
      2  |    chr1               18       21  -               0         0
      3  |    chr1               23       27  -               1         1
      4  |    chr1               28       29  -               0         1
      5  |    chr1               32       37  +               0         1
      6  |    chr1               33       36  +               1         1
PyRanges with 7 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Arguments strand_behavior and match_by are available:

>>> a.count_overlaps(b, strand_behavior='ignore')
  index  |    Chromosome      Start      End  Strand       Count
  int64  |    str             int64    int64  str         uint32
-------  ---  ------------  -------  -------  --------  --------
      0  |    chr1                3        6  +                0
      1  |    chr1               13       15  +                1
      2  |    chr1               18       21  -                1
      3  |    chr1               23       27  -                1
      4  |    chr1               28       29  -                1
      5  |    chr1               32       37  +                1
      6  |    chr1               33       36  +                1
PyRanges with 7 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

>>> a2.count_overlaps(b2, strand_behavior='ignore', match_by='odd')
  index  |    Chromosome      Start      End  Strand        odd     Count
  int64  |    str             int64    int64  str         int64    uint32
-------  ---  ------------  -------  -------  --------  -------  --------
      0  |    chr1                3        6  +               1         0
      1  |    chr1               13       15  +               1         0
      2  |    chr1               18       21  -               0         0
      3  |    chr1               23       27  -               1         1
      4  |    chr1               28       29  -               0         0
      5  |    chr1               32       37  +               0         1
      6  |    chr1               33       36  +               1         0
PyRanges with 7 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Advanced metrics (e.g. interval fraction covered by overlap) can be computed using the compute_interval_metrics function, which is run on the output of join_overlaps:

>>> a.join_overlaps(b, strand_behavior='ignore').compute_interval_metrics('fraction')
  index  |    Chromosome      Start      End  Strand      Start_b    End_b  Strand_b      fraction
  int64  |    str             int64    int64  str           int64    int64  str            float64
-------  ---  ------------  -------  -------  --------  ---------  -------  ----------  ----------
      1  |    chr1               13       15  +                12       14  +             0.5
      2  |    chr1               18       21  -                19       20  +             0.333333
      3  |    chr1               23       27  -                25       29  -             0.5
      4  |    chr1               28       29  -                25       29  -             1
      5  |    chr1               32       37  +                34       36  +             0.4
      6  |    chr1               33       36  +                34       36  +             0.666667
PyRanges with 6 rows, 8 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Find the closest interval: nearest_ranges 

Method nearest_ranges allows to find the closest interval in other for each interval in self:

>>> a.nearest_ranges(b)
  index  |    Chromosome      Start      End  Strand    Chromosome_b      Start_b    End_b  Strand_b      Distance
  int64  |    str             int64    int64  str       str                 int64    int64  str              int64
-------  ---  ------------  -------  -------  --------  --------------  ---------  -------  ----------  ----------
      0  |    chr1                3        6  +         chr1                    6        8  +                    1
      1  |    chr1               13       15  +         chr1                   12       14  +                    0
      2  |    chr1               18       21  -         chr1                   25       29  -                    5
      3  |    chr1               23       27  -         chr1                   25       29  -                    0
      4  |    chr1               28       29  -         chr1                   25       29  -                    0
      5  |    chr1               32       37  +         chr1                   34       36  +                    0
      6  |    chr1               33       36  +         chr1                   34       36  +                    0
PyRanges with 7 rows, 9 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

The output format is similar to join_overlaps. Note the “Distance” column, which reports the distance between the intervals in self and other.

In case you want to find the nearest interval which does not overlap with each self interval, use exclude_overlaps=True:

>>> a.nearest_ranges(b, exclude_overlaps=True)
  index  |    Chromosome      Start      End  Strand    Chromosome_b      Start_b    End_b  Strand_b      Distance
  int64  |    str             int64    int64  str       str                 int64    int64  str              int64
-------  ---  ------------  -------  -------  --------  --------------  ---------  -------  ----------  ----------
      0  |    chr1                3        6  +         chr1                    6        8  +                    1
      1  |    chr1               13       15  +         chr1                   19       20  +                    5
      2  |    chr1               18       21  -         chr1                   25       29  -                    5
      5  |    chr1               32       37  +         chr1                   19       20  +                   13
      6  |    chr1               33       36  +         chr1                   19       20  +                   14
PyRanges with 5 rows, 9 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

The nearest_ranges method also accepts the strand_behavior argument:

>>> a.nearest_ranges(b, strand_behavior='ignore', exclude_overlaps=True)
  index  |    Chromosome      Start      End  Strand    Chromosome_b      Start_b    End_b  Strand_b      Distance
  int64  |    str             int64    int64  str       str                 int64    int64  str              int64
-------  ---  ------------  -------  -------  --------  --------------  ---------  -------  ----------  ----------
      0  |    chr1                3        6  +         chr1                    6        8  +                    1
      1  |    chr1               13       15  +         chr1                   19       20  +                    5
      2  |    chr1               18       21  -         chr1                   12       14  +                    5
      2  |    chr1               18       21  -         chr1                   25       29  -                    5
      3  |    chr1               23       27  -         chr1                   19       20  +                    4
      4  |    chr1               28       29  -         chr1                   34       36  +                    6
      5  |    chr1               32       37  +         chr1                   25       29  -                    4
      6  |    chr1               33       36  +         chr1                   25       29  -                    5
PyRanges with 8 rows, 9 columns, and 1 index columns (with 1 index duplicates).
Contains 1 chromosomes and 2 strands.

Methods for single PyRanges 

These overlap-related methods return a modified version of the input PyRanges object (self). We will showcase them with this data:

>>> c = pr.PyRanges(dict(Chromosome="chr1",
...                      Start=[1, 4, 10, 12, 19, 20, 24, 28],
...                      End=[5, 7, 14, 16, 27, 22, 25, 30],
...                      Strand=["+", "+", "+", "-", "+", "+", "+", "+"]))
>>> c
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        5  +
      1  |    chr1                4        7  +
      2  |    chr1               10       14  +
      3  |    chr1               12       16  -
      4  |    chr1               19       27  +
      5  |    chr1               20       22  +
      6  |    chr1               24       25  +
      7  |    chr1               28       30  +
PyRanges with 8 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Grouping overlapping intervals: cluster_overlaps 

The most flexible method in this category is cluster_overlaps. This function will detect overlaps among intervals in self, and assign a cluster identifier to each group of overlapping intervals. The object returned is identical to the input, with an additional column “Cluster” containing the cluster identifier:

>>> c.cluster_overlaps()
  index  |    Chromosome      Start      End  Strand      Cluster
  int64  |    str             int64    int64  str          uint32
-------  ---  ------------  -------  -------  --------  ---------
      0  |    chr1                1        5  +                 0
      1  |    chr1                4        7  +                 0
      2  |    chr1               10       14  +                 1
      4  |    chr1               19       27  +                 2
      5  |    chr1               20       22  +                 2
      6  |    chr1               24       25  +                 2
      7  |    chr1               28       30  +                 3
      3  |    chr1               12       16  -                 5
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Note that clusters 1 and 2 are kept separated only because of strand. We introduce argument use_strand, accepted by all overlap-related methods for single PyRanges. When set to False, strand is ignored for overlap detection:

>>> c.cluster_overlaps(use_strand=False)
  index  |    Chromosome      Start      End  Strand      Cluster
  int64  |    str             int64    int64  str          uint32
-------  ---  ------------  -------  -------  --------  ---------
      0  |    chr1                1        5  +                 0
      1  |    chr1                4        7  +                 0
      2  |    chr1               10       14  +                 1
      3  |    chr1               12       16  -                 1
      4  |    chr1               19       27  +                 2
      5  |    chr1               20       22  +                 2
      6  |    chr1               24       25  +                 2
      7  |    chr1               28       30  +                 3
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

The default value of argument use_strand is ‘auto’, which is interpreted as True if the PyRanges object has valid strand information, and False otherwise.

Note that cluster 2 contain intervals with indices 5 and 6 which do not directly overlap, but they both overlap with the interval with index 4. (More generally, a cluster is a connected component of the overlap graph, where two intervals are connected if they overlap directly.)

Argument cluster_column controls the name of the column containing the cluster identifier. Also, argument slack is also available here, and it is analogous to its use in methods for pairs of PyRanges. Its default value is 0. With slack=1, bookended intervals are placed in the same cluster. With slack=2, intervals that are distant at the most 1 bp are placed in the same cluster, like those with index 4 and 7 above; and so on.

>>> c2 = c.cluster_overlaps(slack=2, use_strand=False, cluster_column='myClust')
>>> c2
  index  |    Chromosome      Start      End  Strand      myClust
  int64  |    str             int64    int64  str          uint32
-------  ---  ------------  -------  -------  --------  ---------
      0  |    chr1                1        5  +                 0
      1  |    chr1                4        7  +                 0
      2  |    chr1               10       14  +                 1
      3  |    chr1               12       16  -                 1
      4  |    chr1               19       27  +                 2
      5  |    chr1               20       22  +                 2
      6  |    chr1               24       25  +                 2
      7  |    chr1               28       30  +                 2
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Argument match_by is also available. Only intervals with the same value in the specified column will be considered for overlap detection. Let’s add a gene column to the PyRanges object and compare “Cluster” results with the previous column “myClust”:

>>> c2['gene'] = ['abc'[s % 3] for s in c2.Start] # arbitrary gene assignment
>>> c2.cluster_overlaps(slack=2, use_strand=False, match_by='gene')
  index  |    Chromosome      Start      End  Strand      myClust  gene      Cluster
  int64  |    str             int64    int64  str          uint32  str        uint32
-------  ---  ------------  -------  -------  --------  ---------  ------  ---------
      3  |    chr1               12       16  -                 1  a               0
      6  |    chr1               24       25  +                 2  a               1
      0  |    chr1                1        5  +                 0  b               3
      1  |    chr1                4        7  +                 0  b               3
      2  |    chr1               10       14  +                 1  b               4
      4  |    chr1               19       27  +                 2  b               5
      7  |    chr1               28       30  +                 2  b               5
      5  |    chr1               20       22  +                 2  c               7
PyRanges with 8 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Resolve overlaps: merge_overlaps, split_overlaps, max_disjoint_overlaps 

Various methods exists to obtain a PyRanges object without internal overlaps.

Method merge_overlaps allows to merge overlapping intervals in self. In practice, this function returns the union of all intervals in self. All metadata columns are dropped:

>>> c.merge_overlaps()
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        7  +
      1  |    chr1               10       14  +
      2  |    chr1               19       27  +
      3  |    chr1               28       30  +
      4  |    chr1               12       16  -
PyRanges with 5 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

As before, both slack and use_strand are supported:

>>> c.merge_overlaps(slack=2)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        7  +
      1  |    chr1               10       14  +
      2  |    chr1               19       30  +
      3  |    chr1               12       16  -
PyRanges with 4 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Note that Strand is not reported when use_strand is set to False:

>>> c.merge_overlaps(slack=2, use_strand=False)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                1        7
      1  |    chr1               10       16
      2  |    chr1               19       30
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

On the other hand, method split_overlaps allows to split intervals in self at the position of overlaps, leaving as many bookended intervals as necessary to avoid overlaps. This function drops metadata, too:

>>> pr.options.set_option('max_rows_to_show', 12) # to see all rows
>>> c.split_overlaps()
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        4  +
      1  |    chr1                4        5  +
      2  |    chr1                5        7  +
      3  |    chr1               10       14  +
      4  |    chr1               19       20  +
      5  |    chr1               20       22  +
      6  |    chr1               22       24  +
      7  |    chr1               24       25  +
      8  |    chr1               25       27  +
      9  |    chr1               28       30  +
     10  |    chr1               12       16  -
PyRanges with 11 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Function max_disjoint_overlaps also returns a set of non-overlapping intervals. In this case, however, input intervals are not modified, just filtered. The intervals to return are chosen to maximize the number of intervals in the output. The usual arguments (e.g. use_strand) are available:

>>> pr.options.reset_options()
>>> c2.max_disjoint_overlaps(use_strand=False) # using c2 to show that metadata is retained
  index  |    Chromosome      Start      End  Strand      myClust  gene
  int64  |    str             int64    int64  str          uint32  str
-------  ---  ------------  -------  -------  --------  ---------  ------
      0  |    chr1                1        5  +                 0  b
      2  |    chr1               10       14  +                 1  b
      4  |    chr1               19       27  +                 2  b
      7  |    chr1               28       30  +                 2  b
PyRanges with 4 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.