Operating on coordinates

Operating on coordinates: cheatsheet 

Modifying coordinates 

Interval coordinates (Start, End) can be directly modified like any Series in dataframes. Let’s get some data:

>>> import pyranges1 as pr
>>> ex = pr.example_data.ensembl_gtf
>>> ex = ex[ex.Feature == "exon"].get_with_loc_columns('transcript_id')
>>> ex = ex.sort_ranges(use_strand=False).reset_index(drop=True)
>>> ex
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11868    12227  +           ENST00000456328
      1  |               1    12612    12721  +           ENST00000456328
      2  |               1    13220    14409  +           ENST00000456328
      3  |               1   110952   111357  -           ENST00000471248
      4  |               1   112699   112804  -           ENST00000471248
      5  |               1   120873   120932  -           ENST00000610542
      6  |               1   129054   129223  -           ENST00000610542
      7  |               1   133373   133723  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

We can modify a whole column at once:

>>> ex['Start'] += 5
>>> ex
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11873    12227  +           ENST00000456328
      1  |               1    12617    12721  +           ENST00000456328
      2  |               1    13225    14409  +           ENST00000456328
      3  |               1   110957   111357  -           ENST00000471248
      4  |               1   112704   112804  -           ENST00000471248
      5  |               1   120878   120932  -           ENST00000610542
      6  |               1   129059   129223  -           ENST00000610542
      7  |               1   133378   133723  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Or we can modify a slice of the column:

>>> ex.loc[2:5, 'Start'] -= 5
>>> ex
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11873    12227  +           ENST00000456328
      1  |               1    12617    12721  +           ENST00000456328
      2  |               1    13220    14409  +           ENST00000456328
      3  |               1   110952   111357  -           ENST00000471248
      4  |               1   112699   112804  -           ENST00000471248
      5  |               1   120873   120932  -           ENST00000610542
      6  |               1   129059   129223  -           ENST00000610542
      7  |               1   133378   133723  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Or use a boolean index:

>>> ex.loc[ex.Strand == "+", "Start"] += 5
>>> e=ex.copy()
>>> e
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11878    12227  +           ENST00000456328
      1  |               1    12622    12721  +           ENST00000456328
      2  |               1    13225    14409  +           ENST00000456328
      3  |               1   110952   111357  -           ENST00000471248
      4  |               1   112699   112804  -           ENST00000471248
      5  |               1   120873   120932  -           ENST00000610542
      6  |               1   129059   129223  -           ENST00000610542
      7  |               1   133378   133723  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

On the other hand, pyranges1 offer convenient and intuitive methods to modify coordinates, which deal with the complexity of intervals and strands. Next, we will showcase some of its functionalities, using the e object above as starting point.

Extending intervals 

The extend_ranges method allows to extend the intervals in a PyRanges object. The ext parameter implies an extension in both directions of all intervals:

>>> e.extend_ranges(ext=5)
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11873    12232  +           ENST00000456328
      1  |               1    12617    12726  +           ENST00000456328
      2  |               1    13220    14414  +           ENST00000456328
      3  |               1   110947   111362  -           ENST00000471248
      4  |               1   112694   112809  -           ENST00000471248
      5  |               1   120868   120937  -           ENST00000610542
      6  |               1   129054   129228  -           ENST00000610542
      7  |               1   133373   133728  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

The ext_5 and ext_3 parameters allow to specify separately the extension in the 5’ and 3’ directions, respectively. These operations are strand-aware, meaning that a 5’ extension affects the Start position of intervals on the positive strand, and the End position of intervals on the negative strand, and vice versa for 3’ extensions.

Let’s extend upstream by 10 bases:

>>> e.extend_ranges(ext_5=10)
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11868    12227  +           ENST00000456328
      1  |               1    12612    12721  +           ENST00000456328
      2  |               1    13215    14409  +           ENST00000456328
      3  |               1   110952   111367  -           ENST00000471248
      4  |               1   112699   112814  -           ENST00000471248
      5  |               1   120873   120942  -           ENST00000610542
      6  |               1   129059   129233  -           ENST00000610542
      7  |               1   133378   133733  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Let’s extend by 12 bases on the 5’ end, and 6 bases on the 3’ end:

>>> e.extend_ranges(ext_5=12, ext_3=6)
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11866    12233  +           ENST00000456328
      1  |               1    12610    12727  +           ENST00000456328
      2  |               1    13213    14415  +           ENST00000456328
      3  |               1   110946   111369  -           ENST00000471248
      4  |               1   112693   112816  -           ENST00000471248
      5  |               1   120867   120944  -           ENST00000610542
      6  |               1   129053   129235  -           ENST00000610542
      7  |               1   133372   133735  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

To ignore strand (i.e. treat all intervals as if on the positive strand), use use_strand=False:

>>> e.extend_ranges(ext_5=12, ext_3=6, use_strand=False)
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11866    12233  +           ENST00000456328
      1  |               1    12610    12727  +           ENST00000456328
      2  |               1    13213    14415  +           ENST00000456328
      3  |               1   110940   111363  -           ENST00000471248
      4  |               1   112687   112810  -           ENST00000471248
      5  |               1   120861   120938  -           ENST00000610542
      6  |               1   129047   129229  -           ENST00000610542
      7  |               1   133366   133729  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

In all examples above, the extension is applied independently to all intervals in the PyRanges object. Alternatively, you can group intervals by a column, specified with the group_by argument. When provided, extensions are relative to the transcript, not the interval. In practice, only the first and/or last exons of each transcript may be extended:

>>> e.extend_ranges(ext_5=10, group_by='transcript_id')
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11868    12227  +           ENST00000456328
      1  |               1    12622    12721  +           ENST00000456328
      2  |               1    13225    14409  +           ENST00000456328
      3  |               1   110952   111357  -           ENST00000471248
      4  |               1   112699   112814  -           ENST00000471248
      5  |               1   120873   120932  -           ENST00000610542
      6  |               1   129059   129223  -           ENST00000610542
      7  |               1   133378   133733  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Slicing operations 

Slicing operations are operations that cut the intervals in a PyRanges object to obtain smaller intervals. Intervals may be treated independently (default) or grouped in transcripts.

Method slice_ranges allows to obtain slices by specifying the start and end position, in python notation. So, to get the first 10 bases of each interval, we can do:

>>> e.slice_ranges(start=0, end=10)
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11878    11888  +           ENST00000456328
      1  |               1    12622    12632  +           ENST00000456328
      2  |               1    13225    13235  +           ENST00000456328
      3  |               1   111347   111357  -           ENST00000471248
      4  |               1   112794   112804  -           ENST00000471248
      5  |               1   120922   120932  -           ENST00000610542
      6  |               1   129213   129223  -           ENST00000610542
      7  |               1   133713   133723  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Note above that positions refer to the 5’ end of intervals, meaning that counting occurs from right to left for intervals on the negative strand. You can ignore strand using use_strand=False:

>>> e.slice_ranges(start=0, end=10, use_strand=False)
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11878    11888  +           ENST00000456328
      1  |               1    12622    12632  +           ENST00000456328
      2  |               1    13225    13235  +           ENST00000456328
      3  |               1   110952   110962  -           ENST00000471248
      4  |               1   112699   112709  -           ENST00000471248
      5  |               1   120873   120883  -           ENST00000610542
      6  |               1   129059   129069  -           ENST00000610542
      7  |               1   133378   133388  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

start and end can be provided as positional arguments. end can be omitted. When requesting a slice that is entirely out of bounds, the corresponding rows are absent in output. The following yields intervals from position 200 to their existing 3’ end (i.e. we remove the first 200 bases of each interval). Note that intervals that were <200 bp have no row in output:

>>> e.slice_ranges(200)
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    12078    12227  +           ENST00000456328
      2  |               1    13425    14409  +           ENST00000456328
      3  |               1   110952   111157  -           ENST00000471248
      7  |               1   133378   133523  -           ENST00000610542
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Positions can be negative, in which case they are counted from the end of the interval. To get the last 10 bases of each interval, we can do:

>>> e.slice_ranges(-10)
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    12217    12227  +           ENST00000456328
      1  |               1    12711    12721  +           ENST00000456328
      2  |               1    14399    14409  +           ENST00000456328
      3  |               1   110952   110962  -           ENST00000471248
      4  |               1   112699   112709  -           ENST00000471248
      5  |               1   120873   120883  -           ENST00000610542
      6  |               1   129059   129069  -           ENST00000610542
      7  |               1   133378   133388  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

This returns intervals without their first and last 3 bases:

>>> e.slice_ranges(3, -3)
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11881    12224  +           ENST00000456328
      1  |               1    12625    12718  +           ENST00000456328
      2  |               1    13228    14406  +           ENST00000456328
      3  |               1   110955   111354  -           ENST00000471248
      4  |               1   112702   112801  -           ENST00000471248
      5  |               1   120876   120929  -           ENST00000610542
      6  |               1   129062   129220  -           ENST00000610542
      7  |               1   133381   133720  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Above, each interval is treated independently. Alternatively, you can consider transcripts, grouping intervals (i.e. exons) by a column, specified with the group_by argument. When provided, slice_ranges arguments are relative to the transcript, not the interval. Note that using group_by assumes that exons belonging to the same transcript have no overlap; on the other hand, it does not assume presorting of intervals.

By default, coordinates are relative to spliced transcripts. For example, for a transcript with two exons of 50 bp, the first position of the second exon is considered to be 50 regardless of the length of the intron in-between.

Below we request the first 1500 bases of each spliced transcript. Only exons are counted to sum up to that length, and introns are ignored:

>>> e.slice_ranges(0, 1500, group_by='transcript_id')
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11878    12227  +           ENST00000456328
      1  |               1    12622    12721  +           ENST00000456328
      2  |               1    13225    14277  +           ENST00000456328
      3  |               1   110952   111357  -           ENST00000471248
      4  |               1   112699   112804  -           ENST00000471248
      5  |               1   120873   120932  -           ENST00000610542
      6  |               1   129059   129223  -           ENST00000610542
      7  |               1   133378   133723  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

In the e object, only ENST00000456328 is larger than 1500 bases. Compare it with the result above, noting that its third exon has been shortened:

>>> e
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11878    12227  +           ENST00000456328
      1  |               1    12622    12721  +           ENST00000456328
      2  |               1    13225    14409  +           ENST00000456328
      3  |               1   110952   111357  -           ENST00000471248
      4  |               1   112699   112804  -           ENST00000471248
      5  |               1   120873   120932  -           ENST00000610542
      6  |               1   129059   129223  -           ENST00000610542
      7  |               1   133378   133723  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

So, this will get the first and last 10 bases of each spliced transcript:

>>> first10 = e.slice_ranges(0, 10, group_by='transcript_id')
>>> last10 = e.slice_ranges(-10, group_by='transcript_id')
>>> pr.concat([first10, last10])
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11878    11888  +           ENST00000456328
      4  |               1   112794   112804  -           ENST00000471248
      7  |               1   133713   133723  -           ENST00000610542
      2  |               1    14399    14409  +           ENST00000456328
      3  |               1   110952   110962  -           ENST00000471248
      5  |               1   120873   120883  -           ENST00000610542
PyRanges with 6 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Subsequence operations can be combined with extensions to obtain intervals adjacent to the input ones. For example, this will obtain the 100 bases upstream of each transcript:

>>> e.extend_ranges(ext_5=100, group_by='transcript_id').slice_ranges(0, 100, group_by='transcript_id')
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11778    11878  +           ENST00000456328
      4  |               1   112804   112904  -           ENST00000471248
      7  |               1   133723   133823  -           ENST00000610542
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

This will obtain the 100 bases downstream of each transcript:

>>> e.extend_ranges(ext_3=100, group_by='transcript_id').slice_ranges(-100, group_by='transcript_id')
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      2  |               1    14409    14509  +           ENST00000456328
      3  |               1   110852   110952  -           ENST00000471248
      5  |               1   120773   120873  -           ENST00000610542
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

However, pyranges1 provides more convenients functions to this purpose: upstream and downstream allow to obtain regions upstream or downstream of intervals. They allow to specify the length, as well as any optional gap between the returned intervals and the input ones:

>>> e.downstream(100, group_by='transcript_id')
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      2  |               1    14409    14509  +           ENST00000456328
      3  |               1   110852   110952  -           ENST00000471248
      5  |               1   120773   120873  -           ENST00000610542
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

>>> e.downstream(100, gap=10, group_by='transcript_id')
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      2  |               1    14419    14519  +           ENST00000456328
      3  |               1   110842   110942  -           ENST00000471248
      5  |               1   120763   120863  -           ENST00000610542
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Sometimes, you may want to slice ranges according to non-spliced coordinates. This can be done with slice_ranges setting the count_introns argument to True. So, the following will get the subintervals included in the first 1500 bases of each unspliced transcript:

>>> e.slice_ranges(0, 1500, group_by='transcript_id', count_introns=True)
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11878    12227  +           ENST00000456328
      1  |               1    12622    12721  +           ENST00000456328
      2  |               1    13225    13378  +           ENST00000456328
      3  |               1   111304   111357  -           ENST00000471248
      4  |               1   112699   112804  -           ENST00000471248
      7  |               1   133378   133723  -           ENST00000610542
PyRanges with 6 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Thus, the command above is equivalent to requesting the portions of intervals that overlap with the first 1500 bases of the boundaries of each transcript:

>>> b = e.outer_ranges('transcript_id')
>>> b
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11878    14409  +           ENST00000456328
      1  |               1   110952   112804  -           ENST00000471248
      2  |               1   120873   133723  -           ENST00000610542
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

>>> e.intersect_overlaps( b.slice_ranges(0, 1500) )
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11878    12227  +           ENST00000456328
      1  |               1    12622    12721  +           ENST00000456328
      2  |               1    13225    13378  +           ENST00000456328
      3  |               1   111304   111357  -           ENST00000471248
      4  |               1   112699   112804  -           ENST00000471248
      7  |               1   133378   133723  -           ENST00000610542
PyRanges with 6 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Interval complement 

Another useful operation is to obtain the complement of intervals in a PyRanges object, that is, all the bases that are not covered by any of the intervals. This can be done with the complement_ranges method. Let’s revise our e object:

>>> e
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11878    12227  +           ENST00000456328
      1  |               1    12622    12721  +           ENST00000456328
      2  |               1    13225    14409  +           ENST00000456328
      3  |               1   110952   111357  -           ENST00000471248
      4  |               1   112699   112804  -           ENST00000471248
      5  |               1   120873   120932  -           ENST00000610542
      6  |               1   129059   129223  -           ENST00000610542
      7  |               1   133378   133723  -           ENST00000610542
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

By default, the complement return includes only the internals that are not covered by any of the intervals, and split by strand; it does not include the bases before the first interval or after the last one.

>>> e.complement_ranges()
  index  |      Chromosome    Start      End  Strand
  int64  |        category    int64    int64  category
-------  ---  ------------  -------  -------  ----------
      0  |               1    12227    12622  +
      1  |               1    12721    13225  +
      2  |               1   111357   112699  -
      3  |               1   112804   120873  -
      4  |               1   120932   129059  -
      5  |               1   129223   133378  -
PyRanges with 6 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Argument use_strand allows to ignore strand, returning the complement of all intervals:

>>> e.complement_ranges(use_strand=False)
  index  |      Chromosome    Start      End
  int64  |        category    int64    int64
-------  ---  ------------  -------  -------
      0  |               1    12227    12622
      1  |               1    12721    13225
      2  |               1    14409   110952
      3  |               1   111357   112699
      4  |               1   112804   120873
      5  |               1   120932   129059
      6  |               1   129223   133378
PyRanges with 7 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

A possible application of the complement operation is to obtain the intergenic regions. To do that, we use the boundaries of each transcript group, i.e. object b obtained above:

>>> b.complement_ranges(use_strand=False)
  index  |      Chromosome    Start      End
  int64  |        category    int64    int64
-------  ---  ------------  -------  -------
      0  |               1    14409   110952
      1  |               1   112804   120873
PyRanges with 2 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

Note that the first and last intervals are not included in the output. To do so, we set include_first_interval=True, and provide the chromsize argument, which is a dictionary with chromosome names as keys and their sizes as values:

>>> b.complement_ranges(use_strand=False, chromsizes={'1':249250621}, include_first_interval=True)
  index  |      Chromosome    Start        End
  int64  |        category    int64      int64
-------  ---  ------------  -------  ---------
      0  |               1        0      11878
      1  |               1    14409     110952
      2  |               1   112804     120873
      3  |               1   133723  249250621
PyRanges with 4 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

Another useful application of the complement operation is to obtain the coordinates of introns in a transcript. To do this, the complement must be applied to each transcript independently, that is, grouped by a column. This can be done with the group_by argument:

>>> introns = e.complement_ranges(group_by='transcript_id')
>>> introns
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    12227    12622  +           ENST00000456328
      1  |               1    12721    13225  +           ENST00000456328
      2  |               1   111357   112699  -           ENST00000471248
      3  |               1   120932   129059  -           ENST00000610542
      4  |               1   129223   133378  -           ENST00000610542
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Other slicing operations 

Many genomic analyses involve running a sliding window over the genome or subregions of it. Method window_ranges allows to obtain adjacent windows of a specified size and step that span each interval in a PyRanges object.

>>> g = pr.PyRanges(dict(Chromosome=1, Start=[4, 60, 100], End=[11, 66, 107],
...                      Strand=['+', '+', '-'], Name=['a', 'a', 'b']))
>>> g
  index  |      Chromosome    Start      End  Strand    Name
  int64  |           int64    int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      0  |               1        4       11  +         a
      1  |               1       60       66  +         a
      2  |               1      100      107  -         b
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

For example, let’s get windows of size 3:

>>> g.window_ranges(3)
  index  |      Chromosome    Start      End  Strand    Name
  int64  |           int64    int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      0  |               1        4        7  +         a
      0  |               1        7       10  +         a
      0  |               1       10       11  +         a
      1  |               1       60       63  +         a
      1  |               1       63       66  +         a
      2  |               1      104      107  -         b
      2  |               1      101      104  -         b
      2  |               1      100      101  -         b
PyRanges with 8 rows, 5 columns, and 1 index columns (with 5 index duplicates).
Contains 1 chromosomes and 2 strands.

Windows are generated for each interval independently. Strand is considered: they are generated starting from the 5’ end. To ignore strand, use use_strand=False:

>>> g.window_ranges(3, use_strand=False)
  index  |      Chromosome    Start      End  Strand    Name
  int64  |           int64    int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      0  |               1        4        7  +         a
      0  |               1        7       10  +         a
      0  |               1       10       11  +         a
      1  |               1       60       63  +         a
      1  |               1       63       66  +         a
      2  |               1      100      103  -         b
      2  |               1      103      106  -         b
      2  |               1      106      107  -         b
PyRanges with 8 rows, 5 columns, and 1 index columns (with 5 index duplicates).
Contains 1 chromosomes and 2 strands.

To avoid duplicated indices, run pandas dataframe method reset_index on the output:

>>> g.window_ranges(3).reset_index(drop=True)
  index  |      Chromosome    Start      End  Strand    Name
  int64  |           int64    int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      0  |               1        4        7  +         a
      1  |               1        7       10  +         a
      2  |               1       10       11  +         a
      3  |               1       60       63  +         a
      4  |               1       63       66  +         a
      5  |               1      104      107  -         b
      6  |               1      101      104  -         b
      7  |               1      100      101  -         b
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

To may retain the old index as column, with:

>>> g.window_ranges(3).reset_index(names='g_index')
  index  |      g_index    Chromosome    Start      End  Strand    Name
  int64  |        int64         int64    int64    int64  str       str
-------  ---  ---------  ------------  -------  -------  --------  ------
      0  |            0             1        4        7  +         a
      1  |            0             1        7       10  +         a
      2  |            0             1       10       11  +         a
      3  |            1             1       60       63  +         a
      4  |            1             1       63       66  +         a
      5  |            2             1      104      107  -         b
      6  |            2             1      101      104  -         b
      7  |            2             1      100      101  -         b
PyRanges with 8 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

To ‘window’ a whole genome (e.g. to then quantify reads in each window), pyranges1 offers pyranges1.tile_genome(). Here, you must provide chromosome sizes, with various syntaxes accepted, and again a window size. This function will return windows to cover all the chromosomes:

>>> cs={'chr1':323, 'chr2':125} # creating a dictionary with chromosome sizes
>>> pr.tile_genome(cs, 100)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                0      100
      1  |    chr1              100      200
      2  |    chr1              200      300
      3  |    chr1              300      323
      4  |    chr2                0      100
      5  |    chr2              100      125
PyRanges with 6 rows, 3 columns, and 1 index columns.
Contains 2 chromosomes.

Note that the last window is not full, as the chromosome size is not a multiple of the window size. To ensure tile size consistency, use the full_last_tile parameter:

>>> pr.tile_genome(cs, 100, full_last_tile=True)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                0      100
      1  |    chr1              100      200
      2  |    chr1              200      300
      3  |    chr1              300      400
      4  |    chr2                0      100
      5  |    chr2              100      200
PyRanges with 6 rows, 3 columns, and 1 index columns.
Contains 2 chromosomes.

A related operation is tile_ranges, whose rationale is to obtain only the genome tiles (of a defined size) that overlap the intervals in a PyRanges object:

>>> se = e.loc[[0,7],:]
>>> se
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11878    12227  +           ENST00000456328
      7  |               1   133378   133723  -           ENST00000610542
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

>>> se.tile_ranges(200)
  index  |      Chromosome    Start      End  Strand      transcript_id
  int64  |        category    int64    int64  category    str
-------  ---  ------------  -------  -------  ----------  ---------------
      0  |               1    11800    12000  +           ENST00000456328
      0  |               1    12000    12200  +           ENST00000456328
      0  |               1    12200    12400  +           ENST00000456328
      7  |               1   133200   133400  -           ENST00000610542
      7  |               1   133400   133600  -           ENST00000610542
      7  |               1   133600   133800  -           ENST00000610542
PyRanges with 6 rows, 5 columns, and 1 index columns (with 4 index duplicates).
Contains 1 chromosomes and 2 strands.

Note that, in contrast with window_ranges, the function tile_ranges returns intervals anchored to genome positions: their Start will always be a multiple of the tile size, like pyranges1.tile_genome(), and regardless of the strand of the original intervals. Argument overlap_column can be used to add a column indicating how much of the original interval overlaps with the tile returned:

>>> se.tile_ranges(200, overlap_column='nts')
  index  |      Chromosome    Start      End  Strand      transcript_id          nts
  int64  |        category    int64    int64  category    str                float64
-------  ---  ------------  -------  -------  ----------  ---------------  ---------
      0  |               1    11800    12000  +           ENST00000456328      0.61
      0  |               1    12000    12200  +           ENST00000456328      1
      0  |               1    12200    12400  +           ENST00000456328      0.135
      7  |               1   133200   133400  -           ENST00000610542      0.11
      7  |               1   133400   133600  -           ENST00000610542      1
      7  |               1   133600   133800  -           ENST00000610542      0.615
PyRanges with 6 rows, 6 columns, and 1 index columns (with 4 index duplicates).
Contains 1 chromosomes and 2 strands.

Operating on coordinates