slothy.core.core

Module Contents

Classes

Result

The results of a one-shot SLOTHY optimization run

SlothyBase

Stateless core of SLOTHY.

API

exception slothy.core.core.SlothyException

Bases: Exception

Generic exception thrown by SLOTHY

Initialization

Initialize self. See help(type(self)) for accurate signature.

exception slothy.core.core.SlothySelfCheckException

Bases: slothy.core.core.SlothyException

Exception thrown by SLOTHY during tht selfcheck

Initialization

Initialize self. See help(type(self)) for accurate signature.

class slothy.core.core.Result(config)

Bases: slothy.helper.LockAttributes

The results of a one-shot SLOTHY optimization run

Initialization

property orig_code

Optimization input: Source code

_gen_orig_code_visualized_perf()
_gen_orig_code_visualized_perm()
_gen_orig_code_visualized()
property cycles

The number of cycles that SLOTHY thinks the code will take.

If software pipelining is enabled, this is the expected average cycle count per iteration.

property cycles_bound

A lower bound for the number of cycles obtained during optimization.

This may be lower than the estimated cycle count of the result itself if optimization terminated prematurely, e.g. because of a timeout.

property ipc_bound

An upper bound on the instruction/cycle (IPC) count obtained during optimization.

This may be lower than the IPC value of the result itself if optimization terminated prematurely, e.g. because of a timeout.

property optimization_wall_time

Returns the amount of wall clock time in seconds the optimization has taken

property optimization_user_time

Returns the amount of CPU time in seconds the optimization has taken

property ipc

The instruction/cycle (IPC) count that SLOTHY thinks the code will have.

property orig_code_visualized

Optimization input: Source code, including visualization of reordering

property orig_inputs

The list of input registers in the _original_ source code.

property orig_outputs

The list of output registers in the _original_ source code.

property codesize

The number of instructions in the (original and optimized) source code.

property codesize_with_bubbles

Performance-measure for the optimized source code.

This is the number of issue slots used by the optimized code. Equivalently, after division by the target’s issue width, it is SLOTHY’s expectation of the performance of the code in cycles.

It is also the codomain of the xxx_with_bubbles dictionaries.

property pre_core_post_dict

Dictionary indicating interleaving of iterations.

This dictionary consists of items (i, (pre, core, post)), where i is the original program order position of an instruction, and pre, core, post indicate whether that instruction is an early, core or late instruction in the optimized source code.

An early instruction is one which is pulled into the previous iteration. A late instruction is one which is deferred until the next iteration. A core instruction is one which is left in its original iteration.

This property is only meaningful when software pipelining is enabled.

See also is_pre, is_core, is_post.

is_pre(i, original_program_order=True)

Indicates if the instruction in original program order position i (starting at 0) was marked ‘early’ and thereby pulled into the previous iteration.

If original_program_order is False, the index instead refers to the _new_ program order position with in the kernel of the optimized loop.

This only makes sense when software pipelining was enabled.

is_core(i, original_program_order=True)

Indicates if the instruction in original program order position i (starting at 0) was neither marked ‘early’ nor ‘late’, so stayed in its original iteration.

If original_program_order is False, the index instead refers to the _new_ program order position with in the kernel of the optimized loop.

This only makes sense when software pipelining was enabled.

is_post(i, original_program_order=True)

Indicates if the instruction in original program order position i (starting at 0) was marked ‘late’ and thereby pulled into the next iteration.

This only makes sense when software pipelining was enabled.

num_pre()

In a software pipelining result, the number of ‘early’ instructions.

num_core()

In a software pipelining result, the number of ‘late’ instructions.

num_post()

In a software pipelining result, the number of ‘core’ instructions (neither early nor late).

num_prepost()

In a software pipelining result, the number of early or late instructions. This can be seen as a measure for the amount of interleaving across iterations that has happened.

num_exceptional_iterations()

The number of loop iterations jointly covered by the loop preamble and postamble. In other words, the amount by which the iteration count for the optimized loop kernel is lower than the original iteration count.

property reordering_with_bubbles

The reordering map linking original and optimized source code.

The output ordering includes ‘bubbles’ reflecting where SLOTHY thinks that the target microarchitecture would stall.

This is a map from [0,…,codesize) to [0,…,codesize_with_bubbles).

cycle_position_with_bubbles()

Maps the original program order position of an instruction to the cycle number in which SLOTHY thinks (according to its microarchitecture model) the instruction would execute.

reordering_with_bubbles_inv()

The inverse reordering permutation linking optimized and original source code

get_reordering_with_bubbles(copies)

The reordering permutation linking original and optimized source code after unrolling copies times. The output ordering includes ‘bubbles’ reflecting where SLOTHY thinks that the target microarchitecture would stall.

get_periodic_reordering_with_bubbles(copies)
get_periodic_reordering_with_bubbles_inv(copies)

The inverse permutation of get_periodic_reordering_with_bubbles()

get_periodic_reordering(copies)
get_periodic_reordering_inv(copies)
get_reordering(copies, no_gaps)
get_fully_unrolled_loop(iterations)
get_unrolled_kernel(iterations)
reordering()

The reordering permutation linking original and optimized source code

periodic_reordering_with_bubbles()
periodic_reordering_with_bubbles_inv()

The inverse dictionary to periodic_reordering_with_bubbles

periodic_reordering()
periodic_reordering_inv()

The inverse permutation to periodic_reordering

reordering_inv()

The inverse reordering permutation linking optimized and original source code

property fixlen
property code_raw

Optimized code, without annotations

_get_code(visualize_reordering)
property code

The optimized source code

_get_full_code(log)
selfcheck(log)

Checks that the original and optimized source code have isomorphic DFGs. More specifically, that the reordering permutation stored in Result object yields an isomorphism between DFGs.

When software pipelining is used, this is a bounded check for a fixed number of iterations.

selftest(log)

Run empirical self test, if it exists for the target architecture

selfcheck_with_fixup(log)

Do selfcheck, and consider preamble/postamble fixup in case of SW pipelining

In the presence of cross iteration dependencies, the preamble and postamble may be functionally incorrect and need fixup.

_selfcheck_core(log)
property inputs

The list of input registers in the _optimized_ source code. This is a list of architectural registers, and relates to the inputs of the original source code via Result.input_renaming. For the list of original input register names, use Result.orig_inputs.

property outputs

The list of output registers in the _optimized_ source code. This is a list of architectural registers, and relates to the outputs of the original source code via Result.output_renaming. For the list of original output register names, use Result.orig_outputs.

property input_renamings

Dictionary mapping original input names to architectural register names used in the optimized source code. See also Config.rename_inputs.

property output_renamings

Dictionary mapping original output names to architectural register names used in the optimized source code. See also Config.rename_outputs.

property stalls

The number of stalls in the optimization result.

More precisely: The number of cycles c such that optimization succeeded with up to c * issue_width unused issue slots.

_build_stalls_idxs()
property stall_positions

The positions of instructions in the optimized assembly where SLOTHY expects a stall or unused issue slot.

property kernel

When using software pipelining, the loop kernel of the optimized loop.

property kernel_input_output

When using software pipelining, the dependencies between successive loop iterations.

This is useful if you want to further optimize the preamble (and perhaps some code preceeding it), because the kernel dependencies are the output of the preamble.

property preamble

When using software pipelining, the preamble of the optimized loop.

property postamble

When using software pipelining, the postamble of the optimized loop.

property config

The configuration that was used for the optimization.

property success

Whether the optimization was successful

__bool__()
property valid

Indicates whether the result object is valid.

_require_sw_pipelining()
static _fixup_reordered_pair(t0, t1, logger)
static _fixup_reset(nodes)
static _fixup_finish(nodes, logger)
_offset_fixup_sw(log)
_offset_fixup_straightline(log)
offset_fixup(log)

Fixup address offsets after optimization

fixup_preamble_postamble(log)

Potentially fix up the preamble and postamble

When software pipelining is used in the context of a loop with cross-iteration dependencies, the core optimization step might lead to functionally incorrect preamble and postamble. This function checks if this is the case and fixes preamble and postamble, if necessary.

class slothy.core.core.SlothyBase(Arch: any, Target: any, *, logger: any = None, config: any = None)

Bases: slothy.helper.LockAttributes

Stateless core of SLOTHY.

This class is the technical heart of the package: It implements the conversion of a software optimization problem into a constraint solving problem which can then be passed to an external constraint solver. We use Google OR-Tools.

SlothyBase is agnostic of the target architecture and microarchitecture, which are specified at construction time.

Parameters:
  • Arch (any) – A model of the underlying architecture.

  • Target (any) – A model of the underlying microarchitecture.

  • logger (any) – The logger to be used. If omitted, a child of the root logger will be used.

  • config (any) – The configuration to use. If omitted, the default configuration will be used.

Initialization

property arch

The underlying architecture used by SLOTHY, as a read-only reference to the corresponding field in the configuration.

property target

The underlying microarchitecture used by SLOTHY, as a read-only reference to the corresponding field in the configuration.

property result

The result object of the last optimization.

property success

Indicates whether the last optimiation succeeded.

_reset()
_set_timeout(timeout)
optimize(source, prefix_len=0, suffix_len=0, log_model=None, retry=False)
_load_source(source, prefix_len=0, suffix_len=0)
_mark_loop_siblings()

When using SW pipelining, we internally use two loop iterations. Add references between corresponding instructions in both iterations.

_init_model_internals()
_usage_check()
_reg_is_architectural(reg, ty)
_restrict_input_output_renaming()
_set_avail_renaming_registers()
_dump_avail_renaming_registers()
_add_register_usage(t, reg, reg_ty, var, start_var, dur_var, end_var, spill_var)
_backup_original_code()
class CpSatSolutionCb(logger, objective_description, max_solutions=32, is_good_enough=None, printer=None, variables=None)

Bases: ortools.sat.python.cp_model.CpSolverSolutionCallback

A solution callback class represents objects that are alive during CP-SAT operation and equipped with a callback that is triggered every time CP-SAT finds a new solution.

This callback counts the solutions found so far, and aborts the search when the solution is sufficiently close to the optimum.

Initialization

on_solution_callback()

Triggered when OR-Tools finds a solution to the current constraint problem

solution_count()

The number of solutions found so far

fixup_preamble_postamble()

Potentially fix up the preamble and postamble

When software pipelining is used in the context of a loop with cross-iteration dependencies, the core optimization step might lead to functionally incorrect preamble and postamble. This function checks if this is the case and fixes preamble and postamble, if necessary.

_extract_result()
_extract_positions(get_value)
_extract_input_output_renaming()
_extract_register_renamings(get_value)
_extract_kernel_input_output()
_extract_spills()
_extract_code()
_add_path_constraint(consumer, producer, cb)

Add model constraint cb() relating to the pair of producer-consumer instructions Outside of loop mode, this ignores producer and consumer, and just adds cb(). In loop mode, however, the condition has to be omitted in two cases:

  • The producer belongs to the early part of the first iteration, but the consumer doesn’t.

    In this case, the early part of the first iteration is actually the early part of the third iteration.

  • The consumer belongs to the late part of the second iteration, but the producer doesn’t.

_add_path_constraint_from(consumer, producer, cb_lst)
_get_nodes_by_program_order(low=False, high=False, allnodes=False, inputs=False, outputs=False)
_get_nodes_by_depth(**kwargs)
_get_nodes(by_depth=False, **kwargs)
_add_variables_scheduling()

Create variables for start, end and duration of every instruction, and assign the duration intervals to the units that run the instructions

_add_variables_functional_units()
_add_variables_dependencies()
_add_variables_register_renaming()

Add boolean variables indicating if an instruction uses a certain output register

_add_variables_loop_rolling()
_is_low(t)
_is_high(t)
_is_input(t)
_is_output(t)
_iter_dependencies(with_virt=True, with_duals=True)
_iter_dependencies_with_lifetime()
_iter_cross_iteration_dependencies()
_has_cross_iteration_dependencies()
_add_constraints_lifetime_bounds_single(t)
_add_constraints_lifetime_bounds()
_force_allocation_variant(alloc_dict, combinations, combination_vars)
_forbid_renaming_collision_single(var_dic_a, var_dic_b, condition=None)
_forbid_renaming_collision_many(idx_pairs, var_dic_a, var_dic_b)
_force_renaming_collision(var_dic_a, var_dic_b)
_force_allocation_restriction_single(valid_allocs, var_dict)
_force_allocation_restriction_many(restriction_lst, var_dict_lst)
_add_constraints_register_renaming()
_add_constraints_loop_optimization()
_add_constraints_n_issue()
_add_constraints_locked_ordering()
_add_constraints_scheduling()
_add_constraints_dependency_order()
_add_constraints_latencies()
_add_constraints_register_usage()
_add_constraints_misc()
get_inst_pairs(cond_fst=None, cond_snd=None, cond=None)

Yields all instruction pairs satisfying the provided predicate.

_add_constraints_functional_units()
_add_constraints_loop_periodic()
restrict_early_late_instructions(filter_func)

Forces all instructions not passing the filter_func to be core, that is, neither early nor late instructions.

This is only meaningful if software pipelining is enabled.

force_early(filter_func, early=True)

Forces all instructions passing the filter_func to be early, that is, neither early nor late instructions.

This is only meaningful if software pipelining is enabled.

restrict_slots_for_instruction(t, slots)

This forces the given instruction to be assigned precisely one of the slots in the provided slot list.

This is only useful for microarchitecture models with multi-issuing (otherwise, there’s only slot 0 anyway).

For microarchitectures where functional units have a throughput of 1 instruction per cycle, this option can be an alternative way to model the occupancy of functional units: Rather than assigning length-1 intervals for the occupance of functional units and demanding that they’re disjoint – the default operation of Slothy, which works for multi-cycle instructions – one can identify the slot number with the functional unit that an instruction runs on, and then the non-overlapping constraint is subsumed by the mutual distinctiveness of the program order positions. Note, though, that this approach ‘overwrites’ the actual microarchitectural meaning of the slot number – if there are uarch constraints regarding the slot number (e.g. “instruction XYZ can only be issued on slot 0”) then it should not be used.

restrict_slots_for_instructions(ts, slots)

Restrict issue slots for a list of instructions

filter_instructions_by_property(filter_func)

Returns all nodes for instructions passing the filter

filter_instructions_by_class(cls_lst)

Returns all nodes for instructions belonging the provided list of instruction classes.

restrict_slots_for_instructions_by_class(cls_lst: list, slots: list)

Restrict issue slots for all instructions belonging to the provided list of instruction classes.

Parameters:
  • cls_lst (list) – A list of instruction classes

  • slots (list) – A list of issue slots represented as integers.

restrict_slots_for_instructions_by_property(filter_func: any, slots: list)

Restrict issue slots for all instructions passing the given filter function.

Parameters:
  • filter_func (any) – A predicate on instructions

  • slots (list) – A list of issue slots represented as integers.

_stalls_to_stats(stalls)
_print_stalls(stalls)
_print_stalls_and_spills(bound, variables)
_add_objective(force_objective=False)
_describe_solver()
_init_external_model_and_solver()
_NewIntVar(minval, maxval, name='')
_NewIntervalVar(base, dur, end, name='')
_NewOptionalIntervalVar(base, dur, end, cond, name='')
_NewBoolVar(name='')
_NewConstant(val)
_Add(c)
_AddNoOverlap(lst)
_AddExactlyOne(lst)
_AddMultiplicationEquality(lst, res)
_AddImplication(a, b)
_AddAtLeastOne(lst)
_AddAbsEq(dst, expr)
_AddAllDifferent(lst)
_AddHint(var, val)
_AddMaxEquality(varlist, var)
_export_model()
_solve()
retry(fix_stalls=None)
_dump_model_statistics()