Table of Contents

-- mode: Org; fill-column: 110; coding: utf-8; -- #+TITLE Python my notes

TODO from os import environ as env env.get('MYSQL_PASSWORD')

1. most common structures

1.1. sliced windows

from itertools import islice

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

# or
seq = [0, 1, 2, 3, 4, 5]
window_size = 3

for i in range(len(seq) - window_size + 1):
    print(seq[i: i + window_size])

1.2. compare row to itself

import numpy as np
a = [0,1,2,3,4,5,6,7,8,9]

r = np.zeros((len(a),len(a)))
for x in a:
    for y in a:
        if y<x:
            continue # we skip y!
        r[x,y] = x+y

print(r)
[[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
 [ 0.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
 [ 0.  0.  4.  5.  6.  7.  8.  9. 10. 11.]
 [ 0.  0.  0.  6.  7.  8.  9. 10. 11. 12.]
 [ 0.  0.  0.  0.  8.  9. 10. 11. 12. 13.]
 [ 0.  0.  0.  0.  0. 10. 11. 12. 13. 14.]
 [ 0.  0.  0.  0.  0.  0. 12. 13. 14. 15.]
 [ 0.  0.  0.  0.  0.  0.  0. 14. 15. 16.]
 [ 0.  0.  0.  0.  0.  0.  0.  0. 16. 17.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0. 18.]]

2. tools 2022 pypi

2.1. web frameworks

  • Bottle
  • CherryPy
  • Django
  • Falcon
  • FastAPI
  • Flask
  • Hug
  • Pyramid
  • Tornado
  • web2py

2.2. additional libraries

  • aiohttp
  • Asyncio
  • httpx
  • Pillow
  • Pygame
  • PyGTK
  • PyQT
  • Requests
  • Six
  • Tkinter
  • Twisted
  • Kivy
  • wxPython
  • Scrapy

2.3. machine learning frameworks

  • Gensim
  • MXNet
  • NLTK
  • Theano

2.4. cloud platforms do you use? *This question is required.

  • AWS
  • Rackspace
  • Linode
  • OpenShift
  • PythonAnywhere
  • Heroku
  • Microsoft Azure
  • DigitalOcean
  • Google Cloud Platform
  • OpenStack

2.5. ORM(s) do you use together with Python, if any? *This question is required.

  • No database development
  • Tortoise ORM
  • Dejavu
  • Peewee
  • SQLAlchemy
  • Django ORM
  • PonyORM
  • Raw SQL
  • SQLObject

2.6. Big Data tool(s) do you use, if any? *This question is required.

  • None
  • Apache Samza
  • Apache Kafka
  • Dask
  • Apache Beam
  • Apache Hive
  • Apache Hadoop/MapReduce
  • Apache Spark
  • Apache Tez
  • Apache Flink
  • ClickHouse

2.7. Continuous Integration (CI) system(s) do you regularly use? *This question is required.

  • CruiseControl
  • Gitlab CI
  • Travis CI
  • TeamCity
  • Bitbucket Pipelines
  • AppVeyor
  • GitHub Actions
  • Jenkins / Hudson
  • CircleCI
  • Bamboo

2.8. configuration management tools do you use, if any? *This question is required.

  • None
  • Chef
  • Puppet
  • Custom solution
  • Ansible
  • Salt

2.9. documentation tool do you use? *This question is required.

  • I don’t use any documentation tools
  • Sphinx
  • MKDocs
  • Doxygen

2.10. IDE features

  • use Version Control Systems use Version Control Systems: Often use Version Control Systems: From timeto time use Version Control Systems: Never orAlmost never
  • use Issue Trackers use Issue Trackers: Often use Issue Trackers: From timeto time use Issue Trackers: Never orAlmost never
  • use code coverage use code coverage: Often use code coverage: From timeto time use code coverage: Never orAlmost never
  • use code linting (programs that analyze code for potential errors) use code linting (programs that analyze code for potential errors): Often use code linting (programs that analyze code for potential errors): From timeto time use code linting (programs that analyze code for potential errors): Never orAlmost never
  • use Continuous Integration tools use Continuous Integration tools: Often use Continuous Integration tools: From timeto time use Continuous Integration tools: Never orAlmost never
  • use optional type hinting use optional type hinting: Often use optional type hinting: From timeto time use optional type hinting: Never orAlmost never
  • use NoSQL databases use NoSQL databases: Often use NoSQL databases: From timeto time use NoSQL databases: Never orAlmost never
  • use autocompletion in your editor use autocompletion in your editor: Often use autocompletion in your editor: From timeto time use autocompletion in your editor: Never orAlmost never
  • run / debug or edit code on remote machines (remote hosts, VMs, etc.) run / debug or edit code on remote machines (remote hosts, VMs, etc.): Often run / debug or edit code on remote machines (remote hosts, VMs, etc.): From timeto time run / debug or edit code on remote machines (remote hosts, VMs, etc.): Never orAlmost never
  • use SQL databases use SQL databases : Often use SQL databases : From timeto time use SQL databases : Never orAlmost never
  • use a Python profiler use a Python profiler: Often use a Python profiler: From timeto time use a Python profiler: Never orAlmost never
  • use Python virtual environments for your projects use Python virtual environments for your projects: Often use Python virtual environments for your projects: From timeto time use Python virtual environments for your projects: Never orAlmost never
  • use a debugger use a debugger: Often use a debugger: From timeto time use a debugger: Never orAlmost never
  • write tests for your code write tests for your code: Often write tests for your code: From timeto time write tests for your code: Never orAlmost never
  • refactor your code refactor your code: Often refactor your code: From timeto time refactor your code: Never orAlmost never

2.11. isolate Python environments between projects? *This question is required.

  • virtualenv
  • venv
  • virtualenvwrapper
  • hatch
  • Poetry
  • pipenv
  • Conda

2.12. tools related to Python packaging do you use directly? *This question is required.

  • pip
  • Conda
  • pipenv
  • Poetry
  • venv (standard library)
  • virtualenv
  • flit
  • tox
  • PDM
  • twine
  • Containers (eg: via Docker)
  • Virtual machines
  • Workplace specific proprietary solution

2.13. application dependency management? *This question is required.

  • None
  • pipenv
  • poetry
  • pip-tools

2.14. automated services to update the versions of application dependencies? *This question is required.

  • None
  • Dependabot
  • PyUp
  • Custom tools, e.g. a cron job or scheduled CI task
  • No, my application dependencies are updated manually

2.15. installing packages? *This question is required.

  • None
  • pip
  • easy_install
  • Conda
  • Poetry
  • pip-sync
  • pipx

2.16. tool(s) do you use to develop Python applications? *This question is required.

  • None / I'm not sure
  • Setuptools
  • build
  • Wheel
  • Enscons
  • pex
  • Flit
  • Poetry
  • conda-build
  • maturin
  • PDM-PEP517

2.17. job role(s)? *This question is required.

  • Architect
  • QA engineer
  • Business analyst
  • DBA
  • CIO / CEO / CTO
  • Technical support
  • Technical writer
  • Team lead
  • Systems analyst
  • Data analyst
  • Product manager
  • Developer / Programmer

3. install

pip3 install –upgrade pip –user

3.1. change Python version Ubuntu & Debian

update-alternatives –install /usr/bin/python python /usr/bin/python3.8 1 echo 1 | update-alternatives –config python

4. Python theory

4.1. Python [ˈpʌɪθ(ə)n] паисэн

  • interpreted
  • code readability
  • indentation instead of curly braces
  • designed to be highly extensible
  • garbage collector
  • functions are first class citizens
  • multiple inheritance
  • all parameters (arguments) are passed by reference
  • nothing in Python makes it possible to enforce data hiding
  • all classes inherit from object

Multi-paradigm:

  • imperative
  • procedural
  • object-oriented
  • functional (in the Lisp tradition) - (itertools and functools) - borrowed from Haskell and Standard ML
  • reflective
  • aspect-oriented programming by metaprogramming[42] and metaobjects (magic methods)
  • dynamic name resolution (late binding) ?????????

Typing discipline:

  • Duck
  • dynamic
  • gradual (since 3.5) - mey be defined with type(static) or not(dynamic).
  • strong

Python and CPython are managed by the non-profit Python Software Foundation.

The Python Standard Library 3.6

  • string processing (regular expressions, Unicode, calculating differences between files)
  • Internet protocols (HTTP, FTP, SMTP, XML-RPC, POP, IMAP, CGI programming)
  • software engineering (unit testing, logging, profiling, parsing Python code)
  • operating system interfaces (system calls, filesystems, TCP/IP sockets)

4.2. philosophy

document Zen of Python (PEP 20)

  • Beautiful is better than ugly
  • Explicit is better than implicit
  • Simple is better than complex
  • Complex is better than complicated
  • Readability counts
  • Errors should never pass silently. Unless explicitly silenced.
  • There should be one– and preferably only one –obvious way to do it.
  • If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea.
  • Namespaces are one honking great idea – let's do more of those!

Other

  • "there should be one—and preferably only one—obvious way to do it"
  • goal - keeping it fun to use ( spam and eggs instead of the standard foo and bar)
  • pythonic - related to style (code is pythonic )
  • Pythonists, Pythonistas, and Pythoneers - питонутые

https://peps.python.org/pep-0020/#id3

4.3. History

Every revision of Python enjoys performance improvements over the previous version.

  • 1989
  • 2000 - Python 2.0 - cycle-detecting garbage collector and support for Unicode
  • 2008 - Python 3.0 - not completely backward-compatible - include the 2to3 utility, which automates (at least partially) the translation of Python 2 code to Python 3.
  • 2009 Python 3.1 ordered dictionaries,
  • 2015 Python 3.5 typed varibles
  • 2016 Python 3.6 asyncio, Formatted string literals (f-strings), Syntax for variable annotations.
    • PEP523 API to make frame evaluation pluggable at the C level.

3.7

  • built-in breakpoint() function that calls pdb. before was: import pdb; pdb.set_trace()
  • @dataclass - class annotations shugar
  • contextvars module - mechanism for managing Thread-local context variables, similar to thread-local storage (TLS), PEP 550
  • from dataclasses import dataclass @dataclass - comes with basic functionality already implemented: instantiate, print, and compare data class instances

3.8

  • Positional-Only Parameter: pow(x, y, z=None, /)
  • Assignment Expressions: if (match := pattern.search(data)) is not None: - This feature allows developers to assign values to variables within an expression.
  • f"{a=}", f"Square has area of {(area := length**2)} perimeter of {(perimeter := length*4)}"
  • new SyntaxWarnings: when to choose is over ==, miss a comma in a list

3.9

  • Merge (|) and update (|=) added to dict library to compliment dict.update() method and {**d1, **d2}.
  • Added str.removeprefix(prefix) and str.removesuffix(suffix) to easily remove unneeded sections of a string.
  • More Flexible Decorators: Traditionally, a decorator has had to be a named, callable object, usually a function or a class. PEP 614 allows decorators to be any callable expression.
    • before: decorator: '@' dotted_name [ '(' [arglist] ')' ] NEWLINE
    • after: decorator: '@' namedexpr_test NEWLINE
  • typehints: list[int] do not require import typing;
  • Annotated[int, ctype("char")] - integer that should be considered as a char type in C.
  • Better time zones handling.
  • The new parser based on PEG was introduced, making it easier to add new syntax to the language.

3.10

  • Structural pattern matching (PEP 634) was added, providing a way to match against and destructure data structures.
    • match command.split(): case [action, obj]: # interpret action, obj
  • The new Parenthesized context managers syntax (PEP 618) was introduced, making it easier to write context managers using less boilerplate code.
  • Improved error messages and error recovery were added to the parser, making it easier to debug syntax errors.
  • Parenthesized Context Managers: This feature improves the readability of with statements by allowing developers to use parentheses. with (open("test_file1.txt", "w") as test, open("test_file2.txt", "w") as test2):

3.11

  • The built-in pip package installer was upgraded to version 21.0, providing new features and improvements to package management.
  • Improved error messages and error handling were added to the interpreter, making it easier to understand and recover from runtime errors.
  • Some of the built-in modules were updated and improved, including the asyncio and typing modules.
  • Better hash randomization: This improves the security of Python by making it more difficult for attackers to exploit hash-based algorithms that are used for various operations such as dictionary lookups.
  • package has been deprecated

3.12

  • distutils removed
  • allow perf - linux profiler, new API for profilers, sys.monitoring
  • buffer protocol - access to the raw region of memory
  • type-hits:
    • TypedDict - source of types. for typing **kwargs
    • doesn't need to import TypeVar. func[T] syntax to indicate generic type references
    • @override decorator can be used to flag methods that override methods in a parent
  • concurrency preparing:

    • Immortal objects - to implement other optimizations (like avoiding copy-on-write)
    • subinterpreters - the ability to have multiple instances of an interpreter, each with its own GIL, no

    end-user interface to subinterpreters.

    • asyncio is larger and faster
  • sqlite3 module: command-line interface has been added to the
  • unittest: Add a –durations command line option, showing the N slowest test cases

4.3.1. 3.0

  • Old feature removal: old-style classes, string exceptions, and implicit relative imports are no longer supported.
  • exceptions now need the as keyword, exec as var
  • with is now built in and no longer needs to be imported from future.
  • range: xrange() from Python 2 has been replaced by range(). The original range() behavior is no longer available.
  • print changed
  • input
  • all text content such as strings are Unicode by default
  • / -> float, in 2.0 it was integer. // operator added.
  • Python 2.7 cannot be translation to Python 3.

4.4. Implementations

CPython, the reference implementation of Python

  • interpreter and a compiler as it compiles Python code into bytecode before interpreting it
  • (GIL) problem - only one thread may be processing Python bytecode at any one time
    • One thread may be waiting for a client to reply, and another may be waiting for a database query to execute, while the third thread is actually processing Python code.
    • Concurrency can only be achieved with separate CPython interpreter processes managed by a multitasking operating system

implementations that are known to be compatible with a given version of the language are IronPython, Jython and PyPy.

  • IronPython -C#- use JIT- targeting the .NET Framework and Mono. created here known not to work under CPython
  • PyPy - just-in-time compiler. written completely in Python.
  • Jython - Python in Java for the Java platform

CPython based:

  • Cython - translates a Python script into C and makes direct C-level API calls into the Python interpreter

Stackless Python - a significant fork of CPython that implements microthreads; it does not use the C memory stack, thus allowing massively concurrent programs.

Numba - NumPy-aware optimizing runtime compiler for Python

MicroPython - Python for microcontrollers (runs on the pyboard and the BBC Microbit)

Jython and IronPython - do not have a GIL and so multithreaded execution for a CPU-bound python application will work. These platforms are always playing catch-up with new language features or library features, so unfortunately

Pythran, a static Python-to-C++ extension compiler for a subset of the language, mostly targeted at numerical computation. Pythran can be (and is probably best) used as an additional backend for NumPy code in Cython.

mypyc, a static Python-to-C extension compiler, based on the mypy static Python analyser. Like Cython's pure Python mode, mypyc can make use of PEP-484 type annotations to optimise code for static types. Cons: no support for low-level optimisations and typing, opinionated Python type interpretation, reduced Python compatibility and introspection after compilation

Nuitka, a static Python-to-C extension compiler.

  • Pros: highly language compliant, reasonable performance gains, support for static application linking (similar to cython_freeze but with the ability to bundle library dependencies into a self-contained executable)
  • Cons: no support for low-level optimisations and typing

Brython is an implementation of Python 3 for client-side web programming (in JavaScript). It provides a subset of Python 3 standard library combined with access to DOM objects. It is packaged in Gentoo as dev-python/brython.

4.5. Bytecode:

  • Java is compiled into bytecode and then executed by the JVM.
  • C language is compiled into object code, and then becomes the executable file after the linker
  • Python is first converted to the bytecode and then executed via ceval.c. The interpreter directly executes thetranslated instruction set.

Bytecide is a set of instructions for a virtual machine which is called the Python Virtual Machine (PVM).

The PVM is an interpreter that runs the bytecode.

The bytecode is platform-independent, but PVM is specific to the target machine. .pyc file.

The bytecode files are stored in a folder named pycache. This folder is automatically created when you try to import another file that you created.

manually create it: manually create it: python -m compileall file_1.py … file_n.py

4.6. terms

binding the name to the object - x = 2 - (generic) name x receives a reference to a separate, dynamically allocated object of numeric (int) type of value 2

4.7. Indentation - Отступ слева and blank lines

Количество отступов не важно.

if True: print "Answer" // both prints called suite and header line with : - if print "True" else: print "Answer" print "False"

Blank Lines - ignored

semicolon ( ; ) allows multiple statements

Внутри:

  • INDENT - token означающий начало нового блока
  • DEDENT - конец блока.

4.8. mathematic

  • арифметика произвольной точности длина чисел ограничена только объёмом доступной памяти
  • Extensive mathematics library, and the third-party library NumPy that further extends the native capabilities
  • a < b < c - support

4.9. WSGI (Web Server Gateway Interface)(whiskey)

  • calling convention for web servers to forward requests to web applications or frameworks written in the Python programming language.
  • like Java's "servlet" API.
  • WSGI middleware components, which implement both sides of the API, typically in Python code.

5. scripting

5.1. top-level script enironment

__name__ - equal to 'main' when as a script or "python -m" or from an interactive prompt. 'main' is the name of the scope in which top-level code executes.

if name == "main": - not execute when imported

__file__ - full path to module file

5.2. command line arguments parsing

import sys

print 'Number of arguments:', len(sys.argv), 'arguments.' print 'Argument List:', str(sys.argv)

getopt module for better

5.3. python executable

  • -c cmd : program passed in as string (terminates option list)
  • -m mod : run library module as a script (terminates option list)
  • -O : remove assert and debug-dependent statements; add .opt-1 before .pyc extension; also PYTHONOPTIMIZE=x
  • -OO : do -O changes and also discard docstrings; add .opt-2 before .pyc extension
  • -s : don't add user site directory to sys.path; also PYTHONNOUSERSITE. Disable home/u2.local/lib/python3.8/site-packages
  • -S : don't imply 'import site' on initialization
    • /usr/lib/python38.zip
    • /usr/lib/python3.8
    • /usr/lib/python3.8/lib-dynload

5.4. current dir

script_dir=os.path.dirname(os.path.abspath(file))

5.5. unix logger

def init_logger(level, logfile_path: str = None):
    """
    stderr  WARNING ERROR and CRITICAL
    stdout < WARNING

    :param logfile_path:
    :param level: level for stdout
    :return:
    """

    formatter = logging.Formatter('mkbsftp [%(asctime)s] %(levelname)-6s %(message)s')
    logger = logging.getLogger(__name__)
    logger.setLevel(level)  # debug - lowest
    # log file
    if logfile_path is not None:
        h0 = logging.FileHandler(logfile_path)
        h0.setLevel(level)
        h0.setFormatter(formatter)
        logger.addHandler(h0)
    # stdout -- python3 script.py 2>/dev/null | xargs
    h1 = logging.StreamHandler(sys.stdout)
    h1.setLevel(level)  # level may be changed
    h1.addFilter(lambda record: record.levelno < logging.WARNING)
    h1.setFormatter(formatter)
    # stderr -- python3 script.py 2>&1 >/dev/null | xargs
    h2 = logging.StreamHandler(sys.stderr)
    h2.setLevel(logging.WARNING)  # fixed level
    h2.setFormatter(formatter)

    logger.addHandler(h1)
    logger.addHandler(h2)
    return logger

5.6. How does python find packages?

sys.path - Initialized from the environment variable PYTHONPATH, plus an installation-dependent default.

find module:

  • import imp
  • imp.find_module('numpy')

5.7. dist-packages and site-packages?

  • dist-packages is a Debian-specific convention that is also present in its derivatives, like Ubuntu. Modules are installed to dist-packages when they come from the Debian package manager. This is to reduce conflict between the system Python, and any from-source Python build you might install manually.

https://wiki.debian.org/Python

5.8. file size and modification date

os.stat(pf).st_size
os.stat(pf).st_mtime

5.9. environment

os.environ - dictionary

try … except KeyError: - no variable in dictionary

os.environ.get('FLASK_SOME_STAFF') - None if no key

if

export BBB ; python
os.environ['BBB'] # KeyError
DEBUG = os.environ.get('DEBUG', False) # sed DEBUG to  True of False

5.10. -m mod - run library module as a script

https://peps.python.org/pep-0338/

  • __name__ is always 'main'

5.10.1. e.g. mymodule/__main__.py:

import argparse

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-p", "--port", action="store", default="8080")
    parser.add_argument("--host", action="store", default="0.0.0.0")
    args = parser.parse_args()
    port = int(args.port)
    host = str(args.host)
    app.run(host=host, port=port, debug=False)
    return 0

if __name__=="__main__":
    main()

6. Data model

Five standard data types −

  • Numbers
  • String
  • List :list - []
  • Tuple :tuple - ()
  • Dictionary :dict - {}
  • Callable :callable
  • :object

6.1. special types

https://docs.python.org/3/reference/datamodel.html

  • None - a single value
  • NotImplemented - Numeric methods and rich comparison methods should return this value if they do not implement the operation for the operands provided.
  • Ellipsis - accessed through the literal … or the built-in name Ellipsis.
  • numbers.Number
  • Sequences - represent finite ordered sets indexed by non-negative numbers (len() for sequence)
    • mutable: lists, Byte Arrays
    • immutable: str, tuple, bytes
  • Set types -
    • Sets - mutable
    • Frozen sets - frozenset()
  • Mappings - indexet by [2:3], have del and
  • Callable
    • Instance methods
    • Generator functions - function or method which uses the yield statement
      • when called, always returns an iterator object
    • Coroutine functions - async def - when called, returns a coroutine object
    • Asynchronous generator functions
    • Built-in functions
    • Built-in methods
    • Classes - factories for new instances of themselves
    • Class Instances - can be made callable by defining a __call__() method in their class.
  • Modules name The module’s name, doc, file - The pathname of the file from which the module was loaded,__annotations__, dict is the module’s namespace as a dictionary object.
  • Custom classes -
  • Class instances

6.2. theory

  • everything is an object, even classes. (Von Neumann’s model of a “stored program computer”)
  • object has identity, a type and a value
  • identity - address in memory, never changed once created instance
    • id(object) = identity
    • x is y - compare identities x is not y
  • type or class
    • type()
  • value of some objects can change - mutable immutable - even if refered object inside mutable
    • numbers, strings and tuples are immutable
    • dictionaries and lists are mutable

6.3. Types build-in

  • None - name to access single object - to signify the absence of a value = false.
  • NotImplemented - name to access single object - Numeric methods and rich comparison methods should return this value if they do not implement the operation for the operands provided. = true.
  • Ellipsis - single object with name to access - or Ellipsis = true
  • numbers.Number - immutable
    • numbers.Integral
      • Integers (int) - unlimited range
      • Booleans (bool) - 0 and 1, in most contextes "False" or "True"
    • numbers.Real (float) - underlying machine architecture определеяет accepted range and handling of overflow
    • numbers.Complex (complex) - z.real and z.imag - pair of machine-level double precision floating point numbers
  • Sequences - finite ordered sets len() - index a[i]: 0 to n-1; min(s), max(s) ; s * n - n copies of s ; s + t concatenation; x in s - True if an item of s is equal to x
    • Immutable sequences - list.index(obj)
      • str - UTF-8 - s[0] = string with length 1(code point). ord(s) - code point to 0 - 10FFFF ; chr(i) int to integer.; str.encode() -> bytes.decode() <-
      • Tuple - неизменяемый (), (1,) (1,'23') any type.
      • range()
      • Bytes - items are 8-bit byte = 1-255 - literal xb'ab' ; bytes() - creates;
    • Mutable unhashable - del list[0] - без первого -
      • List - изменяемый [1,'3'] any type.
      • Byte Array - bytearray - bytearray()
      • memoryview
  • Set types - unordered - finite sets of unique - immutable - compare by == ; has len()
    • set - mutable - items must be imutable x in set for x in set - {'h', 'o', 'l', 'e'}
    • frozenset - immutable and hashable - it can be used again as an element of another set
  • Mappings - finite sets, finctions: del a[k], len()
    • Dictionary - mutable - Keys are unique within a dictionary - indexed by nearly arbitrary values -_Keys must be immutable_ - {2 : 'Zara', 'Age' : 7, 'Class' : 'First'} dict[3] = "my" # Add new entry
  • Callable types - to which call operation can be applied - код, который можеть быть вызван
    • User-defined functions
    • Instance methods: read-only attributes:
    • Generator functions - function which returns a generator iterator. It looks like a normal function except that it contains yield expressions ??????
    • Coroutine functions - async def - returns a coroutine object ???
    • Asynchronous generator functions
    • Built-in functions - len() and math.sin() (math is a standard built-in module)
    • Built-in methods alist.append()
    • Classes - act as factories for new instances of themselves. arguments of the call are passed to __new__()
    • Class Instances - may be callable by defining a __call__() method
  • Modules
  • Custom classes

6.4. Truth Value Testing

false:

  • None and False.
  • zero of any numeric type: 0, 0.0, 0j, Decimal(0), Fraction(0, 1)
  • empty sequences and collections: '', (), [], {}, set(), range(0)

6.5. Shallow and deep copy operations

  • import copy
  • copy.copy(x) Return a shallow copy of x.
  • copy.deepcopy(x[, memo]) Return a deep copy of x.
  • calss own copy: __copy__() and __deepcopy__()

7. typed varibles or type hints

variable_name: type

7.1. typing.Annotated and PEP-593

data models, validation, serialization, UI

v: Annotated[T, *x]

  • v: a “name” (variable, function parameter, . . . )
  • T: a valid type
  • x: at least one metadata (or annotation), passed in a variadic way. The metadata can be used for either static analysis or at runtime.

Ignorable: When a tool or a library does not support annotations or encounters an unknown annotation it should just ignore it and treat annotated type as the underlying type.

stored in obj.__annotations__

7.1.1. from typing import get_type_hints

@dataclass
class Point:
  x: int
  y: Annotated[int, Label("ordinate")]
{'x': <class 'int'>, 'y': typing.Annotated[int, Label('ordinate')]}

7.1.2. Use case: A calendar Event model, using pydantic https://github.com/pydantic/pydantic

from pydantic import BaseModel
class Event(BaseModel):
    summary: str
    description: str | None = None
    start_at: datetime | None = None
    end_at: datetime | None = None

# -- Validation on datetime fields (using Pydantic)


from pydantic import AfterValidator

class Event(BaseModel):
    summary: str
    description: str | None = None
    start_at: Annotated[datetime | None, AfterValidator(tz_aware)] = None
    end_at: Annotated[datetime | None, AfterValidator(tz_aware)] = None

def tz_aware(d: datetime) -> datetime:
    if d.tzinfo is None or d.tzinfo.utcoffset(d) is None:
        raise ValueError ("expecting a TZ-aware datetime")
    return d

# -- iCalendar serialization support

TZDatetime = Annotated[datetime, AfterValidator(tz_aware)]

from . import ical

class Event(BaseModel):
    summary: Annotated[str, ical.Serializer(label="summary")]
    description: Annotated[str | None, ical.Serializer(label="description")] = None
    start_at: Annotated[TZDatetime | None, ical.Serializer(label="dtstart")] = None
    end_at: Annotated[TZDatetime | None, ical.Serializer(label="dtend")] = None

# module: ical
@dataclass
class Serializer:
    label: str

    def serialize(self, value: Any) -> str:
        if isinstance(value, datetime):
            value = value.astimezone(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
        return f"{self.label.upper()}:{value}"


def serialize_event(obj: Event) -> str:
    lines = []
    for name, a, _ in get_annotations(obj, Serializer):
        if (value := getattr(obj, name, None)) is not None:
            lines.append(a.serialize(value))
    return "\n".join(["BEGIN:VEVENT"] + lines + ["END:VEVENT"])
# console rendering

# >>> evt = Event(
# ... summary="FOSDEM",
# ... start_at=datetime(2024, 2, 3, 9, 00, 0, tzinfo=ZoneInfo("Europe/Brussels")),
# ... end_at=datetime(2024, 2, 4, 17, 00, 0, tzinfo=ZoneInfo("Europe/Brussels")),
# ... )
# >>> print(ical.serialize_event(evt))
# BEGIN:VEVENT
# SUMMARY:FOSDEM
# DTSTART:20240203T080000Z
# DTEND:20240204T160000Z
# END:VEVENT

7.2. function annotation

def function_name(parameter1: type) -> return_type:
from typing import Dict

def get_first_name(full_name: str) -> str:
    return full_name.split(" ")[0]

fallback_name: Dict[str, str] = {
    "first_name": "UserFirstName",
    "last_name": "UserLastName"
}

raw_name: str = input("Please enter your name: ")
first_name: str = get_first_name(raw_name)

# If the user didn't type anything in, use the fallback name
if not first_name:
    first_name = get_first_name(fallback_name)

print(f"Hi, {first_name}!")

8. Strings

Quotation [kwəʊˈteɪʃn] fot string: single ('), double (") and triple (''' or """) quotes to denote string literals

8.1. основы

S = 'str'; S = "str"; S = '''str''';

para_str = """this is a long string that is made up of
several lines and non-printable characters such as
TAB ( \t ) and they will show up that way when displayed.
NEWLINEs within the string, whether explicitly given like
this within the brackets [ \n ], or just a NEWLINE within
the variable assignment will also show up."""

8.1.1. multiline

  1. s = """My Name is Pankajin Developers community."""
  2. s = ('asd' 'asd') = asdasd
  3. backslash
s = "My Name is Pankaj. " \
    "website in Developers community."
  1. s = ' '.join(("My Name is Pankaj. I am the owner of", "JournalDev.com and"))

8.2. A formatted string literal or f-string

equivalent to format()

  • '!s' calls str() on the expression
  • '!r' calls repr() on the expression
  • '!a' calls ascii() on the expression.
>>> name = "Fred"
>>> f"He said his name is {name!r}." # repr() is equivalent to !r
"He said his name is 'Fred'."

Символов после запятой

>>> width = 10
>>> precision = 4
>>> value = decimal.Decimal("12.34567")
>>> f"result: {value:{width}.{precision}}"  # nested fields
'result:      12.35'

Форматирование даты:

>>> today = datetime(year=2017, month=1, day=27)
>>> f"{today:%B %d, %Y}"  # using date format specifier
'January 27, 2017'
>>> number = 1024
>>> f"{number:#0x}"  # using integer format specifier
'0x400'

format:

>>> '{:,}'.format(1234567890)
'1,234,567,890'
>>> 'Correct answers: {:.2%}'.format(19/22)
'Correct answers: 86.36%'

8.3. String Formatting Operator

  • print ("My name is %s and weight is %d kg!" % ('Zara', 21))

8.4. string literal prefixes

str or strings - immutable sequences of Unicode code points.

r' R' raw strings
Raw strings do not treat the backslash as a special character at all. print (r'C:\\nowhere')
b' B' bytes (NOT str)
may only contain ASCII characters
(no term)
::

8.5. raw strings, Unicode, formatted

  • r'string' - treat backslashes as literal characters
  • f'string' or F'string' - f"He said his name is {name!r}." - formatted

8.6. Efficient String Concatenation

  • concatination at runtime
#Fastest:
s= ''.join([`num` for num in xrange(loop_count)])

def g():
    sb = []
    for i in range(30):
        sb.append("abcdefg"[i%7])

    return ''.join(sb)

print g()   # abcdefgabcdefgabcdefgabcdefgab

8.7. byte string

b''

  • byte string tp unicode : str.decode()
  • unicode to byte string: str.encode('')

Your string is already encoded with some encoding. Before encoding it to ascii, you must decode it first. Python is implicity trying to decode it (That's why you get a UnicodeDecodeError not UnicodeEncodeError).

9. Classes

  • Class object - support two kinds of operations: attribute references and instantiation.
  • Instance object - attribute references - data and methods

there is data attributes correspond to “instance variables” in Smalltalk, and to “data members” in C++. - - static varible - shared by each instance.

  • instance varibles may be reassigned
  • instance methods may be reassigned to any method or function. it is just an alias

object - parent for all classes

  • __class__ - class of instance
  • __init__
  • __new__
  • __init_subclass__
  • 'delattr', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'gt', 'hash', 'le', 'lt', 'ne', 'reduce', 'reduce_ex', 'repr', 'setattr', 'sizeof', 'str', 'subclasshook'

9.1. basic

class MyClass:
        a=None
c = MyClass()
c.a = 3 # instance

class MyClass:
    """MyClass.i and MyClass.f are valid attribute references"""
    i = 12345 # class value
    def __init__(self, a):
        self.i = a # create new object value
    def f(self):
        print("f")

x = MyClass(2) # instance ERROR!
x.a = 3; # data attibute

print(x.a)
print(x.i)
print(MyClass.i)
print(x.f)
print(MyClass.f)
# MyClass.f and x.f — it is a method object, not a function object.
3
2
12345
<bound method MyClass.f of <__main__.MyClass object at 0x7f37165d4790>>
<function MyClass.f at 0x7f37165c5440>
class Dog:
    kind = 'canine'         # class variable shared by all instances
    tricks = []             # static!

    def __init__(self, name):
        self.name = name    # instance variable unique to each instance

#-------------- class method
: class C:
:    @classmethod
:    def f(cls, arg1, arg2, ...): ...
#May be called for class C.f() or for instance C().f() For derived class
#                  derived class object is passed as the implied first argument.

9.2. Special Attributes

  • instance.__class__ - The class to which a class instance belongs.
  • class.__mro__ or mro() - This attribute is a tuple of classes that are considered when looking for base classes during method resolution.
  • class.__subclasses__() - Each class keeps a list of weak references to its immediate subclasses.

Class -name The class name.

  • __module__ The name of the module in which the class was defined.
  • __dict__ The dictionary containing the class’s namespace.
  • __bases__ A tuple containing the base classes, in the order of their occurrence in the base class list.
  • __doc__ The class’s documentation string, or None if undefined.
  • __annotations__ A dictionary containing variable annotations collected during class body execution. For best practices on working with annotations, please see Annotations Best Practices.
  • __new__(cls,…) - static method - special-cased so you need not declare it as such. The return value of __new__() should be the new object instance (usually an instance of cls).
    • typically: super().__new__(cls[, …]) with appropriate arguments and then modifying the newly-created instance as necessary before returning it.
    • then the new instance’s __init__() method will be invoked
  • __call__(self,…)

Class instances

  • super() - Return a proxy object that delegates method calls to a parent or sibling class of type

9.3. inheritance

9.3.1. Constructor

  • classes whose base class is object should not call super().__init__()
  • class inherited from object by default
  • you should never write a class that inherits from object and doesn't have an init method

designed for cooperative inheritance: class CoopFoo: def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # forwards all unused arguments

super(type, object-or-type)

  • type - get parent or sibling of type
  • object-or-type.mro() determines the method resolution order to be searched

super(self.__class__, self) == super()

9.3.2. Subclassing:

  • direct - a - b
  • indirect - a - b - c
  • virtual - abstract base class
class SubClassName (ParentClass1[, ParentClass2, ...]):
   'Optional class documentation string'
   class_suite

9.3.3. built-in functions that work with inheritance:

  • isinstance(obj, int) - True only if obj.__class__ is int or some class derived from int
  • issubclass(bool, int) - True since bool is a subclass of int
  • type(ins) == a.__class__
  • type(ins) is Class_name
  • isinstance(ins, Class_name)
  • issubclass(ins.__class__, Class_name)
  • class.mro() - get class.__mro__ attribute

9.3.4. example

class aa():
    def __init__(self, aaa, vv):
        self.aaa = aaa
        self.vv = vv

    def get(self):
        print(self.aaa + self.vv)

class bb(aa):
    def __init__(self, aaa, *args, **kwargs):
        super().__init__(aaa, *args, **kwargs)
        self.aaa = aaa +'asd'


s = bb('aa', 'vv')
s.get()
>> aaasdvv

9.3.5. Multiple inheritance - left-to-right

  • Method Resolution Order (MRO) (какой метод вызывать из родителей) changes dynamically to support cooperative calls to super() (class.__mro__) (obj.__class__.__mro__)

__spam textually replaced with _classname__spam - в родительском классе при наследовании

9.3.6. Abstract class (ABC - abstract base class)

Notes:

  • Dynamically adding abstract methods to a class, or attempting to modify the abstraction status of a method or class once it is created, are not supported.
from abc import ABCMeta

class MyABC(metaclass=ABCMeta):
    @abstractmethod
    def foo(self): pass

# or
from abc import ABC

class MyABC(ABC):
    @abstractmethod
    def foo(self): pass

class B(A):
    def __init__(self, first_name, last_name, salary):
        super().__init__(first_name, last_name) # if A has __init__
        self.salary = salary
    def foo(self):
        return true

9.3.7. Virtual sublasses

Virtual subclass - subclass and their descendants of ABC. Made with register method which overloading isinstance() and issubclass()

class MyABC(metaclass=ABCMeta):    pass
MyABC.register(tuple)
assert issubclass(tuple, MyABC) # tuple is virtual subclass of MyABC now

9.3.8. calling parent class constructor

9.4. Getters and setters

  • no private variables

@property - pythonic way

class Celsius:
    def __init__(self, temperature = 0):
        self.temperature = temperature

    def to_fahrenheit(self):
        return (self.temperature * 1.8) + 32

    def get_temperature(self):
        print("Getting value")
        return self._temperature

    def set_temperature(self, value):
        if value < -273:
            raise ValueError("Temperature below -273 is not possible")
        print("Setting value")
        self._temperature = value

    temperature = property(get_temperature,set_temperature)

>>> c.temperature
Getting value
0
>>> c.temperature = 37
Setting value


#----------- OR ------
class Celsius:
    def __init__(self, temperature = 0):
        self.temperature = temperature

    def to_fahrenheit(self):
        return (self.temperature * 1.8) + 32

    @property
    def temperature(self):
        print("Getting value")
        return self._temperature

    @temperature.setter
    def temperature(self, value):
        if value < -273:
            raise ValueError("Temperature below -273 is not possible")
        print("Setting value")
        self._temperature = value

9.5. Polymorphism [pɔlɪˈmɔːfɪzm

inheritance for shared behavior, not for polymorphism

class Square(object):
    def draw(self, canvas): pass

class Circle(object):
    def draw(self, canvas): pass

shapes = [Square(), Circle()]
for shape in shapes:
    shape.draw('canvas')

9.6. Protocols or emulation

Это переопределение скрытых методов, которые позволяют использовать класс в конструкциях.

Protocol Methods Supports syntax
Sequence slice in getitem etc. seq[1:2]
Iterators __iter__, next for x in coll:
Comparision __eq__, gt etc. x == y, x > y
Numeric __add__, sub, and, etc. x+y, x-y, x&y ..
String like __str__, unicode, repr print(x)
Attribute access __getattr__, setattr obj.attr
Context managers __enter__, exit with open('a.txt') as f:f.read()

9.7. private and protected

  • public - all
  • Protected: _property
  • Provate: __property

9.8. object

object() or object - base for all clases

dir(object())

['class', 'delattr', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'gt', 'hash', 'init', 'init_subclass', 'le', 'lt', 'ne', 'new', 'reduce', 'reduce_ex', 'repr', 'setattr', 'sizeof', 'str', 'subclasshook']

  • __dict__ − Dictionary containing the class's namespace.
  • __doc__ - docstring
  • __init__ - constructor
  • __str__ - toString() - Return a string version of object
  • __name_ - Class name
  • __module__ - Module name in which the class is defined. This attribute is "main" in interactive mode.
  • __bases__ − A possibly empty tuple containing the base classes, in the order of their occurrence in the base class list.
  • __hash__' - hashcode()
  • __repr__ - string printable representation of an object

9.9. Singleton

  • simple
  • отложенный
  • Singleton на уровне модуля - Все модули по умолчанию являются синглетонами

9.9.1. example

class Singleton(object):
    def __new__(cls):
        if not hasattr(cls, 'instance'):
            cls.instance = super(Singleton, cls).__new__(cls)
        return cls.instance
# Отложенный экземпляр в Singleton
class Singleton:
    __instance = None
    def __init__(self):
        if not Singleton.__instance:
            print(" __init__ method called..")
        else:
            print("Instance already created:", self.getInstance())
    @classmethod
    def getInstance(cls):
        if not cls.__instance:
            cls.__instance = Singleton()
        return cls.__instance

9.9.2. шаблон Monostate

чтобы экземпляры имели одно и то же состояние

class Borg:
   __shared_state = {"1": "2"}
   def __init__(self):
      self.x = 1
      self.__dict__ = self.__shared_state
      pass
b = Borg()
b1 = Borg()
b.x = 4
print("Borg Object 'b': ", b) ## b and b1 are distinct objects
print("Borg Object 'b1': ", b1)
print("Object State 'b':", b.__dict__)## b and b1 share same state
print("Object State 'b1':", b1.__dict__)
>> ("Borg Object 'b': ", <__main__.Borg instance at 0x10baa5a70>)
>> ("Borg Object 'b1': ", <__main__.Borg instance at 0x10baa5638>)
>> ("Object State 'b':", {'1': '2', 'x': 4})
>> ("Object State 'b1':", {'1': '2', 'x': 4})

9.10. anonumous class

9.10.1. 1

class Bunch(dict): getattr, setattr = dict.get, dict.__setitem__

dict(x=1,y=2) or {'x':1,'y':2}

Bunch(dict())

9.11. replace method

class A():
    def cc(self):
        print("cc")

c = A.cc

def ff(self):
    print("ff")
    c(self)

A.cc = ff
a = A()
a.cc()
ff
cc
class A():
    def cc(self):
        print("cc")

a = A()
c = a.cc

def ff(self):
    print("ff")
    c()

A.cc = ff
a = A()
a.cc()
ff
cc

10. modules and packages

  • module - file
  • package - folder - must have: init.py to be able to import folder as a module.
  • __main__.py - allow to execute folder: python -m folder

module can define

  • functions
  • classes
  • variables
  • runnable code.

When a module is imported (anyhow) into a script, the code in the top-level portion of a module is executed only once.

Import whole file - обращаться с файлом -

import module1[, module2[,... moduleN]
import support   #just a file support.py

support.print_func("Zara")

Import specific thing from file to access without module

from modname import name1[, name2[, ... nameN]]
from modname import *

__name__ - name of this module.

Locating Modules:

  • current dir
  • PYTHONPATH - shell variable - list of directories
  • default path. On UNIX usr/local/lib/python3

build-in functions

  • dir(math) - list of strings containing the names defined by a module or in current
  • locals() - within a function, it will return all the names that can be accessed locally from that function (dictionary)
  • global() return dictionary type
  • reload(module) reexecute the top-level code of module.

To make all of your functions available when you have imported Phone:

from Pots import Pots
from Isdn import Isdn
from G3 import G3

Main

def main(args):pass
if __name__ == '__main__':  #name of module-namespace. '__main__' for - $python a.py
    import sys
    main(sys.argv)
    quit()

10.1. module special attributes (Module level "dunders") [-ʌndə(ɹ)]

  • __name__
  • __doc__
  • __dict__ - module’s namespace as a dictionary object
  • __file__ - is the pathname of the file from which the module was loaded, if it was loaded from a file.
  • __annotations__ - optional - dictionary containing variable annotations collected during module body execution

11. folders/files USECASES

  • list files and directories deepth=1: os.listdir()->list
  • list only files deepth=1 os.listdir() AND os.path.isfile()

12. functions

  • python does not support method overloading
  • Можно объявлять функции внутри функций
  • Функции видят область где они определены, а не где вызваны.
  • Если функция ничего не возвращает, то возвращает None
  • Функция может возвращать return a, b = (a,b) котороые присваиваются нескольким переменным : a,b = c()

12.1. by value or by reference

by value:

  • immutable:
    • strings
    • integers
    • tuples
    • others…

by reference:

  • muttable:
    • objects
    • lists, sets, dicts

12.2. Types of Аргументы функции

  • Positional arguments (first, second, third=None, fourth=None) (first, second) - positional, (third, fourth) - Keyword arguments
  • Keyword arguments - printinfo( age = 50, name = "miki" ) - order does not metter
  • Default arguments - def printinfo( name, age = 35 ):
  • Variable-length or Arbitrary Argument Lists positional arguments
def printinfo( arg1, *vartuple ):
  for var in vartuple:
     print (var)
printinfo (1, 'asd','d31', 'cv')
  • Variable-length or Arbitrary Argument Lists Keyword arguments
def save_ranking(**kwargs):
  print(kwargs)
save_ranking(first='ming', second='alice', fourth='wilson', third='tom', fifth='roy')
>>> {'first': 'ming', 'second': 'alice', 'fourth': 'wilson', 'third': 'tom', 'fifth': 'roy'}
  • both
def save_ranking(*args, **kwargs):
save_ranking('ming', 'alice', 'tom', fourth='wilson', fifth='roy')

12.3. example

def functionname( parameters:type ) -> return_type:
   "function_docstring"
   function_suite
   return [expression]


def readit(file :str, fun :callable) ->list:

12.4. arguments, anonymous-lambda, global variables

Anonymous Functions: - one-line version of a function

lambda [arg1 [,arg2,.....argn]]:expression
(lambda x, y: x + y)(1, 2)

global variables can be accessesd from all functions (except lambda??? - working in console)

# global Money  # Uncomment to replace local Money to global.
  Money = Money + 1 #local

12.5. attributes

User-defined function

  • __doc__
  • __name__
  • __qualname__
  • __module__
  • __defaults__
  • __code__
  • __globals__
  • __dict__
  • __closure__
  • __annotations__
  • __kwdefaults__

    Instance methods: read-only attributes:

  • __self__ - class instance object
  • __func__ - function object
  • __module__ - name of the module the method was defined in

12.6. function decorators

function that get one function and returning another function

  • when you need to extend the functionality of functions that you don't want to modify
  • @classmethod

Typically used to catch exceptions in wrapper

  def p_decorate(f):
     def inner(name): # wrapper
         # do something here!
         f() # we call wrapped function
     return inner

  my_get_text = p_decorate(get_text) # обертываем, теперь
  my_get_text("John") #о бертка вернет и вызовет вложенную

  #syntactic sugar
  @p_decorate
  def get_text(name):
     return "bla " + name

  #-------------
  get_text = div_decorate(p_decorate(strong_decorate(get_text)))
  # Equal to
  @div_decorate
  @p_decorate
  @strong_decorate

  #-------------- Passing arguments to decorators ------
  def tags(tag_name):
      def tags_decorator(func):
          def func_wrapper(name):
              return "<{0}>{1}</{0}>".format(tag_name, func(name))
          return func_wrapper
      return tags_decorator

  @tags("p")
  def get_text(name):
      return "Hello "+name
  def get_text(name):

12.7. build-in

https://docs.python.org/3/library/functions.html

abs(x)
absolute value
all(iterable)
all elements of the iterable are true or empty = true
any(iterable)
any element is true or empty = false
ascii(object)
printable representation of an object
breakpoint(*args, **kws)
drops you into the debugger at the call site. calls sys.breakpointhook() which calls calls pdb.set_trace()
callable(object)
if the object - callable type - true. (classes are callable )
@classmethod
function decorator. May be called for class C.f() or for instance C().f() For derived class derived class object is passed as the implied first argument.
class C:
   @classmethod
   def f(cls, arg1, arg2, ...): ...
compile(source, filename, mode, flags=0, dont_inherit=False, optimize=-1)
into code or AST object - can be executed by exec() or eval(). Mode - 'exec' if source consists of a sequence of statements. 'eval' if it consists of a single expression
delattr(object, name)
like setattr() - delattr(x, 'foobar') is equivalent to del x.foobar.
divmod(a, b)
ab-two (non complex) numbers = quotient and remainder when using integer division
enumerate(iterable, start=0)
return iterator which returns tuple (0, arg1), (1,arg1) ..
eval(expression, globals=None, locals=None)
string is parsed and evaluated as a Python expression . The globals() and locals() functions returns the current global and local dictionary, respectively, which may be useful to pass around for use by eval() or exec().
exec(object[, globals[, locals]])
object must be either a string or a code object. Be aware that the return and yield statements may not be used outside of function definitions even within the context of code passed to the exec() function. The return value is None.
filter(function, iterable)
Construct an iterator from those elements of iterable for which function returns true.
getattr(object, name[, default])
eturn the value of the named attribute of object. name must be a string or AttributeError is raised
setattr(object, name, value)
assigns the value to the attribute, provided the object allows it
globals()
dictionary representing the current global symbol table (inside a function or method, this is the module where it is defined, not the module from which it is called)x
hasattr(object, string name)
result is True if the string is the name of one of the object’s attributes, False if not
hash(object)
Hash values are integers. Object __hash__() method.
id(object)
“identity” of an object - integer. Unique and constant during life time. Two objects with non-overlapping lifetimes may have the same id() value.
isinstance(object, classinfo)
True if object is an instance of the classinfo argument.
issubclass(class, classinfo)
true if class is a subclass of classinfo. class is considered a subclass of itself
iter(object[, sentinel])
1) Return an iterator object. __iter__() or __getitem__() 2) object must be a callable object __next__() if the value returned is equal to sentinel, StopIteration will be raised
next(iterator[, default])
__next__() If default is given, it is returned if the iterator is exhausted
len(s)
.
map(function, iterable, …)
Return an iterator that applies function to every item of iterable. May be applied in parallel to may iterable.
max/min(iterable, *[, key, default])
.
max/min(arg1, arg2, *args[, key])
largest item in an iterable or the largest of two or more arguments
memoryview(obj)
memory view” object
pow(x, y[, z])
(x** y) % z
repr(object)
__repr__() method - printable representation of an object
reversed(seq)
__reversed__() method or support sequence protocol (the __len__() method and the __getitem__()
round(number[, ndigits])
number rounded to ndigits precision after the decimal point
sorted(iterable, *, key=None, reverse=False)
sorted list [] from the items in iterable
@staticmethod
method into a static method.
sum(iterable[, start])
returns the total
super([type[, object-or-type]])
Return a proxy object that delegates method calls to a parent/parents or sibling class of type
vars([object])
__dict__ attribute for a module, class, instance, or any other object
zip(*iterables)
Make an iterator of tuples that aggregates elements from each of the iterables.
  • list(zip([1, 2, 3],[1, 2, 3])) = [(1, 1), (2, 2), (3, 3)]
  • unzip: list(zip(*zip([1, 2, 3],[1, 2, 3]))) = [(1, 2, 3), (1, 2, 3)]
__import__(name, globals=None, locals=None, fromlist=(), level=0)
not needed in everyday Python programming

class bool([x])
standard truth testing procedure see 6.4
class bytearray([source[, encoding[, errors]]])
-mutable If it is a string, you must also give the encoding - it will use str.encode()
class bytes([source[, encoding[, errors]]])
-immutable
class complex([real[, imag]])
complex('1+2j'). - default - 0j
class dict(**kwarg)
dict(one=1, two=2, three=3) = {'one': 1, 'two': 2, 'three': 3}; dict([('two', 2), ('one', 1), ('three', 3)])
class dict(mapping, **kwarg)
????
class dict(iterable, **kwarg)
dict(zip(['one', 'two', 'three'], [1, 2, 3]))
class float([x])
from a number or string x.
class frozenset([iterable])
see 6.3.
class int([x])
x.__int__() or x.__trunc__().
class int(x, base=10)
.
class list([iterable])
.
class object
Return a new featureless object.
class property(fget=None, fset=None, fdel=None, doc=None)
class range(stop)
class range(start, stop[, step])
immutable sequence type
class set([iterable])
.
class slice(stop)
.
class str(object='')
.
class str(object=b'', encoding='utf-8', errors='strict')
.
tuple([iterable])
.
class type(object)
object.__class__
class type(name, bases, dict)
.

input([prompt])
return input input from stdin.
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
Open file and return a corresponding file object.
print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)
to file or sys.stdout
dir([object])
list of valid attributes for that object. or list of names in the current local scope. __dir__() - method called - dir() - Is supplied primarily as a convenience for use at an interactive prompt
help([object])
built-in help system
locals()
the current local symbol table

bin(x)
bin(3) -> '0b11'
chr(i)
Return the string representing a character = i - Unicode code
hex(x)
hex(255) = '0xff'
format(value[, format_spec])
https://docs.python.org/3/library/string.html#formatspec
oct(x)
Convert an integer number to an octal string prefixed with “0o”.
ord(c)
c - string representing one Unicode character. Return integer.

12.8. Closure

def compose_greet_func(name):
    def get_message():
        return "Hello there "+name+"!"

    return get_message

greet = compose_greet_func("John")
print(greet())

12.9. overloading

from functools import singledispatch

@singledispatch
def func(arg1, arg2):
    print("default implementation of func - ", arg1, arg2)

@func.register
def func_impl_1(arg1: str, arg2):
    print("Implementation of func with first argument as string - ", arg1, arg2)

@func.register
def func_impl_2(arg1: int, arg2):
    print("Implementation of func with first argument as int - ", arg1, arg2)


func(1, "hello")
func("test", "hello")
func(1.34, "hi")

Implementation of func with first argument as int -  1 hello
Implementation of func with first argument as string -  test hello
default implementation of func -  1.34 hi

13. asterisk(*)

  1. For multiplication and power operations.
    • 2*3 = 6
    • 2**3 = 8
  2. For repeatedly extending the list-type containers.
    • (0,) * 100
  3. For using the variadic arguments. "Packaging" - def save_ranking(*args, **kwargs):
    • *args - tuple
    • **kwargs - dict
  4. For unpacking the containers.(so-called “unpacking”) чтобы передать список в variadic arguments
def product(*numbers):
product(*[2, 3, 5, 7, 11, 13])
  1. for arguments of function. all after * - keyword ony, after / - positional or keyword only
def another_strange_function(a, b, /, c, *, d):

14. with

with ContexManager() as c1, ContexManager() as c2:

14.1. Context manager class TEMPLATE

class DatabaseConnection(object):
    def __enter__(self):
        # make a database connection and return it
        ...
        return self.dbconn

    def __exit__(self, exc_type, exc_val, exc_tb):
        # make sure the dbconnection gets closed
        self.dbconn.close()

15. Operators and control structures

Ternary operation: a if condition else b

15.1. basic

Arithmetic

  • + - *
  • / - 9/2 = 4,5 - Division
  • % - 9%2 = 1 - Modulus - returns remainder
  • ** - Exponent
  • // - Floor Division 9 //2 = 4 -9/2 = -5
  • += -= *= /= %= **= //=

Comparison = ! <> > < >= <=

Bitwise

  • &
  • |
  • ^ - XOR
  • ~ - ~a = 1100 0011
  • << - a<<2 = 1111 0000
  • >>

Logical - AND - OR - NOT

Membership - in, not in

Identity Operators ( point to the same object) - is, is not

15.2. Operator Precedence (Приоритет) ˈpresədəns

https://docs.python.org/3/reference/expressions.html#operator-precedence

  1. Binding or parenthesized expression, list display, dictionary display, set display
    • (expressions…),
    • [expressions…], {key: value…}, {expressions…}
  2. Subscription, slicing, call, attribute reference
    • x[index], x[index:index], x(arguments…), x.attribute
  3. await x - Await expression
  4. ** - Exponentiation [5]
  5. +x, -x, ~x - Positive, negative, bitwise NOT
  6. *, @, , /, % - Multiplication, matrix multiplication, division, floor division, remainder [6]
  7. +, - - Addition and subtraction
  8. <<, >> - Shifts
  9. & - Bitwise AND
  10. ^ - Bitwise XOR
  11. | - Bitwise OR
  12. in, not in, is, is not, <, <=, >, >=, !=, == - Comparisons, including membership tests and identity tests
  13. not x - Boolean NOT
  14. and - Boolean AND
  15. or - Boolean OR
  16. if – else - Conditional expression
  17. lambda - Lambda expression
  18. := - Assignment expression

old:

  1. **
  2. ~ + - unary
  3. * / % //
  4. + -
  5. >> <<
  6. &
  7. ^ |
  8. <= < > >=
  9. <> = ! Equality operators
  10. = %= /= //= -= += *= **= Assignment operators
  11. is is not
  12. in not in
  13. not or and - Logical operators

15.3. value unpacking

x=("v1", "v2")
a,b = x
print a,b
# v1 v2

T=(1,)
b,=T
# b= 1

15.4. if, loops

if expression1:
    statement(s)
elif statement(s):
    statement(s)

while expression:
   statement(s)

while count < 5:
   print count, " is  less than 5"
   count = count + 1
else:  # when the condition becomes false or at the end
   print count, " is not less than 5"

for iterating_var in sequence:
   statements(s)
else: # when no break encountered
   print num, 'is a prime number'


break # Terminates the loop
continue # skip the remainder
pass # null operation - just stupid empty operator - nothing else.

#Compcat loops, double loop
[print(x,y) for x in range(1000) for y in range(x, len(range(1000)))]
[g for g in [x['whole_word_timestamps'] for x in whisper_stable_result]] # list created everyloop

for item in array: array2.append (item)

15.5. match 3.10

command = input("What are you doing next? ")

match command.split():
    case [action]:
        ... # interpret single-verb action
    case [action, obj]:
        ... # interpret action, obj
    case ["quit"]:
        print("Goodbye!")
        quit_game()

15.6. Slicing Sequence

  • a[i:j] - i to j
  • s[i:j:k] - slice i to j with step k;

s = range(10) - [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

  • s[-2] - = 8
  • s[1:] - [1, 2, 3, 4, 5, 6, 7, 8, 9]
  • s[1::] - [1, 2, 3, 4, 5, 6, 7, 8, 9]
  • s[:2] - [0, 1]
  • s[:-2] - [0, 1, 2, 3, 4, 5, 6, 7]
  • s[-2:] - [8, 9]
  • s[::2] - [0, 2, 4, 6, 8]
  • s[::-1] -[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

16. Traverse or iteration over containers

16.1. iterator object

Behind the scenes for statement calls iter()- iterator object

  • __next__() - when nothig left - raises a StopIteration exception.
#remove in loop: https://docs.python.org/3/reference/compound_stmts.html#the-for-statement
for f in ret[:]:
  ret.remove(f)

for element in [1, 2, 3]:
    print(element)
for element in (1, 2, 3):
    print(element)
for key in {'one':1, 'two':2}:
    print(key)
for char in "123":
    print(char)
for line in open("myfile.txt"):
    print(line, end='')


class Reverse: # add iterator behavior to your classes
    """Iterator for looping over a sequence backwards."""
    def __init__(self, data):
        self.data = data
        self.index = len(data)

    def __iter__(self):
        return self

    def __next__(self):
        if self.index == 0:
            raise StopIteration
        self.index = self.index - 1
        return self.data[self.index]

rev = Reverse('spam')
for char in rev:
    print(char)

#compact form
>>> t = {x: x*x for x in range(0, 4)}
>>> print(t)
{0: 0, 1: 1, 2: 4, 3: 9}

16.2. iterate dictionary

  • for key in a_dict:
  • for item in a_dict.items(): - tuple
  • for key, value in a_dict.items():
  • for key in a_dict.keys():
  • for value in a_dict.values():

Since Python 3.6, dictionaries are ordered data structures, so if you use Python 3.6 (and beyond), you’ll be able to sort the items of any dictionary by using sorted() and with the help of a dictionary comprehension:

  • sorted_income = {k: incomes[k] for k in sorted(incomes)}
  • sorted() - sort keys

17. The Language Reference

17.1. yield and generator expression

form of coroutine

  • (expression comp_for) - (x*y for x in range(10) for y in range(x, x+10)) = <generator object>

Yield - используется для создания генератора. используется для создания лопа.

  • используется только в функции.
  • как return только останавливается после возврата если в лупе или в других случаях
  • async def - asynchronous generator - not iterable - <async_generator object -(Coroutine objects)
  • async gen - not implement iter and next methods

17.2. yield from

allow to

def gen_list1(iterable):
    for i in list(iterable):
        yield i

# equal to:
def gen_list2(iterable):
    yield from list(iterable)

17.3. ex

def agen():
    for n in range(1, 10):
          yield n

[1, 2, 3, 4, 5, 6, 7, 8, 9]


def a():
    for n in range(1, 3):
          yield n
def agen():
    for n in range(1, 7):
          yield from a()

[1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2]

#-------------------------
async def ticker(delay, to):
    """Yield numbers from 0 to *to* every *delay* seconds."""
    for i in range(to):
        yield i
        await asyncio.sleep(delay)

17.4. function decorator

#+name example_1

    def hello(func):
    def inner():
        print("Hello ")
        func()
    return inner

@hello
def name():
    print("Alice")

#+name exampl_2

    def star(n):
    def decorate(fn):
        def wrapper(*args, **kwargs):
            print(n*'*')
            result = fn(*args, **kwargs)
            print(result)
            print(n*'*')
            return result
        return wrapper
    return decorate


@star(5)
def add(a, b):
    return a + b


add(10, 20)

17.5. class decorator

  • print(f.__name__) of wrapper
  • print(f.__doc__) of wrapper

    #+name ex1

         from functools import wraps
    
         class Star:
             def __init__(self, n):
                 self.n = n
    
             def __call__(self, fn):
                 @wraps(fn) # addition to fix f.__name__ and __doc__
                 def wrapper(*args, **kwargs):
                     print(self.n*'*')
                     result = fn(*args, **kwargs)
                     print(result)
                     print(self.n*'*')
                     return result
                 return wrapper
    
         @Star(5)
         def add(a, b):
             return a + b
    
         # or
         add = Star(5)(add)
    
         add(10, 20)
    
    
    

17.6. lines

new line

  • Конец строки - Unix LF, Windows CR LF, Macintosh CR - All of these forms can be used equally, regardless of platform
  • In Python - C conventions for newline characters - \n - ASCII LF

Comments

# - line
""" comment """ - multiline

Line joining - cannot carry a comment

if 1900 < year < 2100 and 1 <= month <= 12 \
  and 1 <= day <= 31 and 0 <= hour < 24 # Looks like a valid date

Implicit line joining

month_names = ['Januari', 'Februari', 'Maart',      #you can
               'Oktober', 'November', 'December']   #do it

Blank line - contains only spaces, tabs, formfeeds(FFor \f) and possibly a comment

17.7. Indentation

  • Leading whitespace (spaces and tabs)
  • determine the grouping of statements
  • TabError - if a source file mixes tabs and spaces in a way that makes the meaning dependent on the worth of a tab in spaces

Tabs are replaced - 1-7

17.8. identifier [aɪˈdentɪfaɪər] or names

[A-Za-z_(0-9 except for firest char)] - case sensitive

Reserved classes of identifiers

  • _*
  • \_\_\*\_\_
  • __*

17.9. Keywords Exactly as written here:

False await else import pass
None break except in raise
True class finally is return
and continue for lambda try
as def from nonlocal while
assert del global not with
async elif if or yield

17.10. Numeric literals

  • integers
  • floating point numbers - 3.14 10. .001 1e100 3.14e-10 0e0 3.14_15_93
  • imaginary numbers ????? - 3.14j 10.j 10j .001j 1e100j 3.14e-10j 3.14_15_93j

-1 - expression composed of the unary operator ‘-‘ and the literal 1

17.10.1. integers

integer ::= decinteger | bininteger | octinteger | hexinteger decinteger ::= nonzerodigit ([""] digit)* | "0"+ ([""] "0")* bininteger ::= "0" ("b" | "B") ([""] bindigit)+ octinteger ::= "0" ("o" | "O") ([""] octdigit)+ hexinteger ::= "0" ("x" | "X") (["_"] hexdigit)+ nonzerodigit ::= "1"…"9" digit ::= "0"…"9" bindigit ::= "0" | "1" octdigit ::= "0"…"7" hexdigit ::= digit | "a"…"f" | "A"…"F"

17.10.2. float

  • floatnumber ::= pointfloat | exponentfloat
  • pointfloat ::= [digitpart] fraction | digitpart "."
  • exponentfloat ::= (digitpart | pointfloat) exponent
  • digitpart ::= digit (["_"] digit)*
  • fraction ::= "." digitpart
  • exponent ::= ("e" | "E") ["+" | "-"] digitpart

3.14 10. .001 1e100 3.14e-10 0e0 3.14_15_93

17.10.3. Imaginary literals

imagnumber ::= (floatnumber | digitpart) ("j" | "J")

3.14j 10.j 10j .001j 1e100j 3.14e-10j 3.14_15_93j

17.11. Docstring and comments

first thing in a class/function/module

''' This is a multiline comment. '''

17.12. Simple statements

  • assert
  • pass
  • del
  • return
  • yield????
  • raise - without argument - re-raise the exception in try except
  • break
  • continue
  • import
  • global indentifiers** - tell pareser to treat identifier as global. Когда есть функция и глобальные переменные
  • nonlocal indentifier** - когда есть функция внутри функции. переменные в первой функции - не глобальные и не локальные

17.13. open external

if shell=True you cannot use array of arguments

17.13.1. ex

# -- 1
import os
os.system("echo Hello World")
# can no pass input
# -- 2
import os
pipe=os.popen("dir *.md")
print (pipe.read())

# -- 2
import subprocess
subprocess.Popen("echo Hello World", shell=True, stdout=subprocess.PIPE).stdout.read()

# -- 3 old
import subprocess
subprocess.call("echo Hello World", shell=True)

# -- 4
import subprocess
print(subprocess.run("echo Hello World", shell=True))

# -- 5
import subprocess
(ls_status, ls_output) = subprocess.getstatusoutput(ls_command)

# -- 6
# returns output as byte string
returned_output = subprocess.check_output(cmd)
# using decode() function to convert byte string to string
print('Current date is:', returned_output.decode("utf-8"))

# -- 7 with timeout
import subprocess
DELAY = 10
po = subprocess.Popen(["sleep 1; echo 'asd\nasd'"], shell=True, stdout=subprocess.PIPE)
po.wait(DELAY)
print(po.stdout.read().decode('utf-8'))
print("ok")

18. The Python Standard Library

18.1. Major libs:

  • os - portable way of using operating system dependent functionality - files, Command Line Arguments, Environment Variables
    • shutil - higher level interface for files
    • glob - file lists from directory
  • logging
  • threading - multi-threading
  • collections - !!!
  • re - regular expression
  • math
  • statistics
  • datetime
  • zlib, gzip, bz2, lzma, zipfile and tarfile.
  • timeit - performance test
  • profile and pstats - tools for identifying time critical sections in larger blocks of code
  • doctest - module provides a tool for scanning a module and validating tests embedded in a program’s docstrings.
  • unittest
  • json
  • sqlite3
  • Internationalization supported by: gettext, locale, and the codecs package

18.2. regex - import re

import re

match
если от начала строки совпадает. Возращает объект MatchObject
fullmatch
whole string match
search
до первого вхождения в строке
compile(pattern)
"Компилирует" регулярное выражение, заданное в качестве строки в объект для последующей работы.
sub
replace substring

Флаги:

  • re.DOTALL - '.' в регексе означает любой символ кроме пробела, с re.DOTALL включая пробел
  • re.IGNORECASE

18.2.1. example

import re

regex = re.compile('[^а-яА-ЯёЁ/-//,. ]')
reg_pu = re.compile('[,]')
reg_pu2 = re.compile(r'\.([а-яА-ЯёЁ])') #.a = '. a'

s = reg_pu.sub(' ', data['naznach'])
s = reg_pu2.sub(r'. \1', s)
nf = regex.sub(' ', s).lower().split()

# -----------------
import re

s = 'asdds https://alalal.com'
m = re.search('https.*')
if m:
  sp = m.span()
  sub = s[sp[0]:sp[1]]


18.2.2. get string between substring

res = re.search("123(.*)789", "123456789) res.group(1) # 456

18.3. datetime

18.3.1. datetime to date

d.date()

18.3.2. date to datetime

18.3.3. current time

datetime.datetime.now()

  • .time() or date()

18.4. file object

https://docs.python.org/3/library/filesys.html

  • os - lower level than Python "file objects"
  • os.path — Common pathname manipulations
  • shutil — High-level file operations
  • tempfile — Generate temporary files and directories
  • Built-in function open() - returns "file object"

file object

18.5. importlib

import importlib
itertools = importlib.import_module('itertools')

g = importlib.import_module('t')
g.v
# from g import v # ERROR

19. exceptions handling

  • syntax errors - repeats the offending line and displays a little ‘arrow’ pointing
  • exceptions
    • last line indicates what happened: stack traceback and ExceptionType: detail based on the type and what caused it
    • exception may have exception’s argument

Words: try, except, else, finally, raise, with

  • BaseException - root exception
  • Exception - non-system-exiting exceptions are derived from this class
  • Warning - warnings.warn("Warning………..Message")

19.1. explanation

try:
    foo = open("foo.txt")
except IOError:
    print("error")
else: # if no exception in try block
    print(foo.read())
finally: # always
    print("finished")

19.2. traceback

two ways

import traceback
import sys

try:
    do_stuff()
except Exception:
    print(traceback.format_exc())
    # or
    print(sys.exc_info()[0])

19.3. examples

  try:
      x = int(input("Please enter a number: "))
      break
  except ValueError:
      print("Oops!  That was no valid number.  Try again...")

  except (RuntimeError, TypeError, NameError):
      pass
  except OSError as err:
      print("OS error: {0}".format(err)
      print("Unexpected error:", sys.exc_info()[0])
  except: #any . with extreme caution!
      print("B")
      raise          # re-raise the exception



  try:
      raise Exception('spam', 'eggs')
  except OSError:
      print(type(inst))    # the exception instance
      print(inst.args)     # arguments stored in .args
      print(inst)          # __str__ allows args to be printed directl
  else:
      print(arg, 'has', len(f.readlines()), 'lines')
      f.close()



  try:
  ...         result = x / y
  ...     except ZeroDivisionError:
  ...         print("division by zero!")
  ...     else:                           #no exception
  ...         print("result is", result)
  ...     finally:                        #always Even with неожиданным exception.
  ...         print("executing finally clause")


  with open("myfile.txt") as f: # f is always closed, even if a problem was encountered
      for line in f:
          print(line, end="")


        try:
            obj = self.method_number_list[method_number](image)
            self.OUTPUT_OBJ = obj.OUTPUT_OBJ
        except Exception as e:
            if hasattr(e, 'message'):
                self.OUTPUT_OBJ = {"qc": 3, "exception": e.message}
            else:
                self.OUTPUT_OBJ = {"qc": 3, "exception": str(type(e).__name__) + " : " + str(e.args)}

20. Logging

20.1. ways to log

  1. loggers: logger = logging.getLogger(name) ; logger.warning("as")
  2. root logger: logging.warning('Watch out!')
logging.basicConfig(level=logging.NOTSET)
root_logger = logging.getLogger()

or

logger = logging.getLogger(__name__)
logger.setLevel(logging.NOTSET)

20.2. terms

handlers
send the log records (created by loggers) to the appropriate destination.
records
log records (created by loggers)
loggers
expose the interface that application code directly uses.
Filters
provide a finer grained facility for determining which log records to output.
Formatters
specify the layout of log records in the final output.

20.3. getLogger()

Multiple calls to getLogger(name) with the same name will always return a reference to the same Logger object.

name - period-separated hierarchical value, like foo.bar.baz

20.4. stderror

deafult:

  • out stderr
  • level = WARNING

20.5. inspection

get all loggers:

[print(name) for name in logging.root.manager.loggerDict]

logger properties:

  • logger.level
  • logger.handlers
  • logger.filters
  • logger.root.handlers[0].formatter._fmt - formatter
  • logger.root.handlers[0].formatter.default_time_format

root logger: logging.root or logging.getLogger()

20.6. levels

  • CRITICAL 50
  • ERROR 40
  • WARNING 30
  • INFO 20
  • DEBUG 10
  • NOTSET 0

21. Collections

21.1. collections.Counter() - dict subclass for counting hashable objects

import collections
cnt = Counter()
cnt[word] += 1
most_common(n)

Return a list of the n most common elements and their counts from the most common to the least.

21.2. time complexity

O - provides an upper bound on the growth rate of the function.

x in c:

  • list - O(n)
  • dict - O(1) O(n)
  • set - O(1) O(n)

set

  • list - O(1) O(1)
  • collections.deque - O(1) O(1) - append
  • dict - O(1) O(n)

get

  • list - O(1) O(1)
  • collections.deque - O(1) O(1) - pop
  • dict - O(1) O(n)

https://wiki.python.org/moin/TimeComplexity

22. Conventions

22.1. code style, indentation, naming

Indentation:

  • 4 spaces per indentation level.
  • Spaces are the preferred indentation method.

Limit all lines to a maximum of 79 characters.

Surround top-level function and class definitions with two blank lines.

Method definitions inside a class are surrounded by a single blank line.

Inside class:

  • capitalizing method names
  • prefixing data attribute names with a small unique string (perhaps just an underscore)
  • using verbs for methods and nouns for data attributes.

naming conventions

  • https://www.python.org/dev/peps/pep-0008/
  • Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability.
  • Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.
  • Class Names - CapWords convention
  • function names - lowercase with words separated by underscores as necessary to improve readability

22.2. 1/2 underscore

Single Underscore: PEP-0008: _single_leading_underscore: weak "internal use" indicator. E.g. from M import * does not import objects whose name starts with an underscore.

Double Underscore: https://docs.python.org/3/tutorial/classes.html#private-variables

  • Any identifier of the form __spam (at least two leading underscores, at most one trailing underscore) is textually replaced with _classname__spam, where classname is the current class name with leading underscore(s) stripped. This mangling is done without regard to the syntactic position of the identifier, so it can be used to define class-private instance and class variables, methods, variables stored in globals, and even variables stored in instances. private to this class on instances of other classes.
  • Name mangling is intended to give classes an easy way to define “private” instance variables and methods, without having to worry about instance variables defined by derived classes, or mucking with instance variables by code outside the class. Note that the mangling rules are designed mostly to avoid accidents; it still is possible for a determined soul to access or modify a variable that is considered private. ( as a way to ensure that the name will not overlap with a similar name in another class.)

22.3. Whitespace in Expressions and Statements

Yes: spam(ham[1], {eggs: 2})
No:  spam ( ham [ 1 ], { eggs: 2 } )
z
Yes: if x == 4: print x, y; x, y = y, x
No:  if x == 4 : print x , y ; x , y = y , x

YES:
i = i + 1
submitted += 1
x = x*2 - 1
hypot2 = x*x + y*y
c = (a+b) * (a-b)

def munge(input: AnyStr): ...
def munge() -> AnyStr: ...

def complex(real, imag=0.0):
return magic(r=real, i=imag)


if foo == 'blah':
    do_blah_thing()
do_one()
do_two()
do_three()

FILES = [
    'setup.cfg',
    'tox.ini',
    ]
initialize(FILES,
           error=True,
           )

No:
FILES = ['setup.cfg', 'tox.ini',]
initialize(FILES, error=True,)

22.4. naming

case sensitive

  • Class names start with an uppercase letter. All other identifiers start with a lowercase letter.
  • Starting an identifier with a single leading underscore indicates that the identifier is private = _i
  • two leading underscores indicates a strongly private identifier = __i
  • Never use the characters 'l' (lowercase letter el), 'O' (uppercase letter oh), or 'I' (uppercase letter eye) as single character variable names.

Package and Module Names - all-lowercase names. _ - не рекомендуется. C/C++ module has a leading underscore (e.g. _socket). https://peps.python.org/pep-0423/

Class Names - CapWords, or CamelCase

functions and varibles Function and varibles names should be lowercase, with words separated by underscores as necessary to improve readability.

  • Always use self for the first argument to instance methods.
  • Always use cls for the first argument to class methods.

Constants MAX_OVERFLOW

22.5. docstrings

Docstring is a first thing in a module, function, class, or method definition. ( doc special attribute).

Convs.:

  • Phrase ending in a period.
  • (""" """) are used even though the string fits on one line.
  • The closing quotes are on the same line as the opening quotes
  • There’s no blank line either before or after the docstring.
  • It prescribes the function or method’s effect as a command (“Do this”, “Return that”), not as a description; e.g. don’t write “Returns the pathname …”.
  • Multiline: 1. summary 2. blank 3. more elaborate description

22.5.1. ex. simple

def kos_root():
    """Return the pathname of the KOS root directory."""

def complex(real=0.0, imag=0.0):
    """Form a complex number.

    Keyword arguments:
    real -- the real part (default 0.0)
    imag -- the imaginary part (default 0.0)
    """
    if imag == 0.0 and real == 0.0:
        return complex_zero

23. Concurrency

https://docs.python.org/3/library/concurrency.html Notes:

  • Preferred approach is to concentrate all access to a resource in a single thread and then use the queue

module to feed that thread with requests from other threads.

coroutine (сопрограмма) - components that allow execution to be suspended and resumed, their sates are saved

23.1. select right API

problems:

  • CPU-Bound Program
  • I/O-bound problem - spends most of its time waiting for external operations

types:

  • multiprocessing - creating a new instance of the Python interpreter to run on each CPU and then farming out part of your program to run on it.
  • threading - Pre-emptive multitasking, The operating system decides when to switch tasks.
    • hard to code, race conditions
  • one thread
  • Coroutines - Cooperative multitasking - The tasks decide when to give up control.
    • asyncio

modules:

  • threading - Thread-based parallelism - fast - better for I/O-bound applications due to the Global Interpreter Lock
  • multiprocessing — Process-based parallelism - slow - better for CPU-bound applications
  • concurrent.futures - high-level interface for asynchronously executing callables ThreadPoolExecutor or ProcessPoolExecutor.
  • subprocess - it’s the recommended option when you need to run multiple processes in parallel or call an external program or external command from inside your Python code. spawn new processes, connect to their input/output/error pipes, and obtain their return codes
  • sched - event scheduler
  • queue - useful in threaded programming when information must be exchanged safely between multiple thread
  • asyncio - coroutine-based concurrency(Cooperative multitasking) The tasks decide when to give up control.

Python-Concurrency-API-Decision-Tree.jpg

Python-Concurrency-API-Pools-vs-Executors.png

Python-Concurrency-API-Worker-Pool-vs-Class.png

23.2. Process

from multiprocessing import Process
# not daemon don't allow to have subprocess
proc: Process = Process(target=self.perform_job, args=(job, queue), daemon=False)
proc.start()
proc.join(WAIT_FOR_THREAD)  # seconds
if proc.is_alive():
  pass

from multiprocessing.pool import Pool
def callback_result(result):
   print(result)
# Pool
executor = Pool(processes=PAGE_THREADS)  # clear leaked memory with process death
for i, fp in enumerate(filelist):
    executor.apply_async(
        page_processing, args=(i, fp, self.id_processing, self.doc_classes, self.barcodes_only),
        callback=callback_result)
executor.close()
executor.join()

23.3. threading

Daemon - daemon thread will shut down immediately when the program exits. default=False

Python (CPython) is not optimized for thread framework.You can keep allocating more resources and it will try spawning/queuing new threads and overloading the cores. You need to make a design change here:

Process based design:

  • Either use the multiprocessing module
  • Make use of rabbitmq and make this task run separately
  • Spawn a subprocess

Or if you still want to stick to threads:

  • Switch to PyPy (faster compared to CPython)
  • Switch to PyPy-STM (totally does away with GIL)

23.3.1. examples

  1. ThreadPoolExecutor - many function for several workers
    def get_degree1(angle):
        return a
    
    def get_degree2(angle):
        return a
    
    import concurrent.futures
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        futures1 = executor.submit(get_degree1, x) # started
        futures2 = executor.submit(get_degree2, x) # started
        data = future1.result()
        data = future1.result()
    
    
  2. ThreadPoolExecutor - one function for several workers
    def get_degree(angle):
       return a
    
    import concurrent.futures
    angles: list = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        futures = {executor.submit(get_degree, x): x for x in degrees}
        for future in concurrent.futures.as_completed(futures):
            # futures[future] # degree
            data = future.result()
            angles.append(data)
    
  3. Custom thread
    from threading import Thread
    
    def foo(bar):
        print 'hello {0}'.format(bar)
        return "foo"
    
    class ThreadWithReturnValue(Thread):
        def __init__(self, group=None, target=None, name=None,
                     args=(), kwargs={}, Verbose=None):
            Thread.__init__(self, group, target, name, args, kwargs)
            self._return = None
        def run(self):
            print(type(self._target))
            if self._target is not None:
                self._return = self._target(*self._args,
                                                    **self._kwargs)
        def join(self, *args):
            Thread.join(self, *args)
            return self._return
    
    twrv = ThreadWithReturnValue(target=foo, args=('world!',))
    
    twrv.start()
    print twrv.join()   # prints foo
    

23.3.2. syncronization

with - acquire() and release()

  • Lock, RLock, Condition, Semaphore, and BoundedSemaphore
  1. Lock and RLock (recurrent version)

    threading.Lock

  2. Condition object - barrier
    • cv = threading.Condition()
    • cv.wait() - stop
    • cv.notifyAll() - resume all in wait
  3. Semaphore Objects - protected section

    maxconnections = 5 pool_sema = BoundedSemaphore(value=maxconnections)

    with pool_sema: conn = connectdb()

  4. Barrier Objects - by number

    b = Barrier(2, timeout=5) # 2 - numper of parties

    b.wait()

    b.wait()

23.4. multiprocessing

 def get_degree(angle):
      return a

from multiprocessing import Process, Manager
    manager = Manager()
    angles = manager.list()  # result angles!
    pool = []
    for x in degrees:
        # angles.append(get_degree(x))
        p = Process(target=get_degree, args=(x, angles))
        pool.append(p)
        p.start()
    for p2 in pool:
        p2.join()

  manager = mp.Manager()
  return_dict = manager.dict()
  jobs = []
  for i in range(len(fileslist)):
      p = mp.Process(target=PageProcessing, args=(i, return_dict, fileslist[i],))
      jobs.append(p)
      p.start()

  for proc in jobs:
      proc.join() # ждем завершение каждого

23.5. asyncio

IO-bound and high-level structured network code. synchronize concurrent code;

Any function that calls await needs to be marked with async.

async as a flag to Python telling it that the function about to be defined uses await.

async with statement, which creates a context manager from an object you would normally await.

cons:

  • all of the advantages of cooperative multitasking get thrown away if one of the tasks doesn’t cooperate.

asyncio.run - ideally only be called once

23.5.1. Core terms:

  • Event Loop - low level the core of every asyncio application, high level: asyncio.run()
  • Coroutines - (async def statement or generator iterator)
  • awaitable object - used for await

23.6. асинхронного программирования (asyncio, async, await)

23.6.1. run - simple

just create new loop and execute one task in it

with Runner(debug=debug) as runner:
       return runner.run(main)
import time
start_time = time.time()

import asyncio
async def main():
    await asyncio.sleep(2)
    print('hello')
    return 2

print (asyncio.run(main()))
print("--- %s seconds ---" % (time.time() - start_time))
print (asyncio.run(main()))
print("--- %s seconds ---" % (time.time() - start_time))
hello
2
--- 2.0038740634918213 seconds ---
hello
2
--- 4.007648944854736 seconds ---

23.6.2. run - await

import time
start_time = time.time()
main()

import asyncio
async def main():
    print('enter')
    await asyncio.sleep(2)
    print("--- %s seconds ---" % (time.time() - start_time))
    print('hello')
    return 2

asyncio.run(main, main)
print("--- %s seconds ---" % (time.time() - start_time))

#+end_src

started at 07:58:04
hello
finished at 07:58:05
world
finished at 07:58:07

23.6.3. Runner

create loop and ContextVars -

import time
start_time = time.time()

import asyncio
async def main():
    await asyncio.sleep(2)
    print('hello')
    return 2

with asyncio.Runner() as runner:
    print (runner.run(main()))
    print("--- %s seconds ---" % (time.time() - start_time))
    print (runner.run(main()))
    print("--- %s seconds ---" % (time.time() - start_time))
hello
2
--- 2.003290891647339 seconds ---
hello
2
--- 4.006376266479492 seconds ---

23.7. example multiprocess, Threads, othe thread

    def main_processing(filelist) -> list:
        """ Multithread page processing

        :param filelist: # файлы PNG страниц PDF входящего файла
        :return: {procnum:(procnum, new_obj.OUTPUT_OBJ), ....}
        """

        # import multiprocessing as mp
        # manager = mp.Manager()
        # return_dict = manager.dict()
        # jobs = []

        # for i in range(len(filelist)):
        #     p = mp.Process(target=page_processing, args=(i, return_dict, filelist[i]))
        #     jobs.append(p)
        #     p.start()
        #
        # for proc in jobs:
        #     proc.join()

        # Threads
        import concurrent.futures
        return_dict: list = []
        with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
            futures = {executor.submit(page_processing, i, x): x for i, x in enumerate(filelist)}
            for future in concurrent.futures.as_completed(futures):
                data = future.result()
                return_dict.append(data)

        # One Thread Debug
        # from threading import Thread
        # thread: Thread = Thread(target=page_processing, args=(0, filelist[0]))
        # thread.start()
        # thread.join()

        return list(return_dict)

24. Monkey patch (modification at runtile)

  • instance.attribute = 23

24.1. replace method of class instance

# -- of not running class
A.f = my_f

# -- of running class
import types

def func_my(self):
    pass


border_collie = Dog()
border_collie.herd  = types.MethodType(func_my, border_collie)

24.2. inspect.getmembers() vs dict.items() vs dir()

  • dir() and inspect.getmembers() are basically the same,
  • __dict__ is the complete namespace including metaclass attributes.

24.3. ex replace function

import werkzeug.serving
import functools

def wrap_function(oldfunction, newfunction):
    @functools.wraps(oldfunction)
    def run(*args): #, **kwargs
        return newfunction(oldfunction, *args) #, **kwargs
    return run

def generate_adhoc_ssl_pair2(oldfunc, parameter=None):
    # Do some processing or something to customize the parameters to pass
    c, k = oldfunc(parameter)
    print(c, c.public_key().public_numbers())
    return c,k


werkzeug.serving.generate_adhoc_ssl_pair = wrap_function(
        werkzeug.serving.generate_adhoc_ssl_pair, generate_adhoc_ssl_pair2)

24.4. ex replace method of class

import werkzeug.serving

oldfunc = werkzeug.serving.BaseWSGIServer.__init__

def myinit(*args, **kwargs):
    # Do some processing or something to customize the parameters to pass
    oldfunc(*args, **kwargs)
    print(dir(args[0].ssl_context))

werkzeug.serving.BaseWSGIServer.__init__ = myinit

25. Performance Tips

25.1. string

  • Avoid:
    • out = "<html>" + head + prologue + query + tail + "</html>"
  • Instead, use
    • out = "<html>%s%s%s%s</html>" % (head, prologue, query, tail)

25.2. loop

  • map(function, list)
  • iterator = (s.upper() for s in oldlist)

25.4. avoid global variables

25.5. dict

wdict = {}
for word in words:
    if word not in wdict:
        wdict[word] = 0
    wdict[word] += 1

# Use:

wdict = {}
for word in words:
    try:
        wdict[word] += 1
    except KeyError:
        wdict[word] = 1

# or:
wdict = {}
get = wdict.get
for word in words:
    wdict[word] = get(word, 0) + 1

# or:
wdict.setdefault(key, []).append(new_element)

# or:
from collections import defaultdict

wdict = defaultdict(int)
for word in words:
    wdict[word] += 1

26. decorators

  • @property - 9.4 - function became read-only variable (getter)
  • @staticmethod - to static method, dont uses self
  • @classmethod - it receives the class object as the first parameter instead of an instance of the class. May be called for class C.f() or for instance C().f(), self.f(). Used for singleton.
Class Method Static Method

;;

Defined as Mutable via inheritance Immutable via inheritance
The first parameter as cls is to be taken in the class method. not needed
Accession or modification of class state is done in a class method.  
Class methods are bound to know about class and access it. dont knew about class

26.1. ex

def d(c):
   print('d', c)

def dec_2(a):
    print('dec_2', a)
    return d


def dec_1():
    print('dec_1')
    return dec_2


@dec_1()
def f(v):
    print('f')

print('s')
f(2)

27. Assert

assert Expression[, Arguments]

If the expression is false, Python raises an AssertionError exception. Python uses ArgumentExpression as the argument for the AssertionError.

assert False, "Error here"

python.exe - The ``-O`` switch removes assert statements, the ``-OO`` switch removes both assert statements and doc strings.

28. Debugging and Profiling

https://habr.com/en/company/mailru/blog/201594/ Profiling - сбор характеристик работы программы

  • Ручное
    • метод пристального взгляда - сложно оценить трудозатраты и результат
    • Ручное - подтвердить или опровергнуть гипотезу узкого места
      • time - Unix tool
  • статистический statistical профайлер - через маленькие промежутки времени берётся указатель на текущую выполняемую функцию
    • gprof - Unix tool C, Pascal, or Fortran77
    • их не много
  • событийный (deterministic, event-based) профайлер - отслеживает все вызовы функций, возвраты, исключения и замеряет интервалы между этими событиями - возможно замедление работы программы в два и более раз
    • Python standard library provides:
      • profile - if cProfile is not available
      • cProfile
  • debugging

28.1. cProfile

primitive calls - without recursion

ncalls
for the number of calls
tottime
time spent inside without subfunctions
percall
tottime/tottime
cumtime
time spent in this and all subfunctions and in recursion
percall
cumtime/ncalls
import cProfile
import re
cProfile.run('re.compile("foo|bar")', filename='restats')
#  pstats.Stats class reads profile results from a file and formats them in various ways.
# python -m cProfile [-o output_file] [-s sort_order] (-m module | myscript.py)

28.2. small code measure 1

python3 -m timeit '"-".join(str(n) for n in range(100))'

def test():
    """Stupid test function"""
    L = [i for i in range(100)]

if __name__ == '__main__':
    import timeit
    print(timeit.timeit("test()", setup="from __main__ import test"))

28.3. small code measure 2

import time
start_time = time.time()
main()
print("--- %s seconds ---" % (time.time() - start_time))

28.4. breakpoint and code investigation

29. inject

29.1. Callable

import inject
# configuration
inject.configure(lambda binder: binder.bind_to_provider('predict', lambda: predict))
# or
def my_config(binder):
  binder.bind_to_provider('predict', lambda: predict)
inject.configure(my_config)

# usage
@inject.params(predict='predict')  # param name to a binder key.
def detect_advanced(self, predict=None) -> (int, any):

30. BUILD and PACKAGING

setup.py - dustutils and setuptools (based on) was most widely used approach. Since PEP 517, PEP 518 - pyproject.toml is recommended format for package.

30.1. build tools:

frontend - read pyproject.toml

backend - defined in [build-system]->build-backend, create the build artifacts, dictates what additional information is required in the pyproject.toml file

  • Hatch or Hatchling
  • setuptools
  • Flit
  • PDM

30.1.1. hatchling

backend and frontend

hatch build /path/to/project
  1. links

30.1.2. setuptools

build backend

collection of enhancements to the Python distutils that allow you to more easily build and distribute Python distributions, especially ones that have dependencies on other packages.

defines the dependencies for a single project, Requirements Files are often used to define the requirements for a complete Python environment.

It is not considered best practice to use install_requires to pin dependencies to specific versions, or to specify sub-dependencies (i.e. dependencies of your dependencies).

  1. ex setup.cfg
    install_requires=[
       'A>=1,<2', # not allow v2
       'B>=2'
    ]
    
  2. old way

    install

    • python setup.py build
    • python setup.py install –install-lib ~/.local/lib/python3.10/site-packages/
  3. links

30.1.3. gpep517

a minimal tool to aid building wheels for Python packages

gpep517 build-wheel --backend setuptools.build_meta --output-fd 3 --wheel-dir /var/tmp/portage/dev-python/flask-2.3.2/work/Flask-2.3.2-python3_11/wheel
gpep517 install-wheel --destdir=/var/tmp/portage/dev-python/flask-2.3.2/work/Flask-2.3.2-python3_11/install --interpreter=/usr/bin/python3.11 --prefix=/usr --optimize=all /var/tmp/portage/dev-python/flask-2.3.2/work/Flask-2.3.2-python3_11/wheel/Flask-2.3.2-py3-none-any.whl

commands:

get-backend
to read build-backend from pyproject.toml (auxiliary command).
build-wheel
to call the respeective PEP 517 backend in order to produce a wheel.
install-wheel
to install a wheel into the specified directory,
install-from-source
that combines building a wheel and installing it (without leaving the artifacts),
verify-pyc
to verify that the .pyc files in the specified install tree are correct and up-to-date.
  1. links

30.2. toml format for pyproject.toml

Tom's Obvious Minimal Language

30.2.1. basic

  • \b - backspace (U+0008)
  • \t - tab (U+0009)
  • \n - linefeed (U+000A)
  • \f - form feed (U+000C)
  • \r - carriage return (U+000D)
  • \" - quote (U+0022)
  • \\ - backslash (U+005C)
  • \uXXXX - unicode (U+XXXX)
  • \UXXXXXXXX - unicode (U+XXXXXXXX)
# This is a TOML comment
str1 = "I'm a string."
str2 = "You can \"quote\" me."
str3 = "Name\tJos\u00E9\nLoc\tSF."

str1 = """
Roses are red
Violets are blue"""

str2 = """\
  The quick brown \
  fox jumps over \
  the lazy dog.\
  """

# Literal strings - No escaping is performed so what you see is what you get
path = 'C:\Users\nodejs\templates'
path2 = '\\User\admin$\system32'
quoted = 'Tom "Dubs" Preston-Werner'
regex = '<\i\c*\s*>'

# multi-line literal strings
re = '''I [dw]on't need \d{2} apples'''
lines = '''
The first newline is
trimmed in raw strings.
All other whitespace
is preserved.
'''

30.2.2. integers

# integers
int1 = +99
int2 = 42
int3 = 0
int4 = -17

# hexadecimal with prefix `0x`
hex1 = 0xDEADBEEF
hex2 = 0xdeadbeef
hex3 = 0xdead_beef

# octal with prefix `0o`
oct1 = 0o01234567
oct2 = 0o755

# binary with prefix `0b`
bin1 = 0b11010110

# fractional
float1 = +1.0
float2 = 3.1415
float3 = -0.01

# exponent
float4 = 5e+22
float5 = 1e06
float6 = -2E-2

# both
float7 = 6.626e-34

# separators
float8 = 224_617.445_991_228

# infinity
infinite1 = inf # positive infinity
infinite2 = +inf # positive infinity
infinite3 = -inf # negative infinity

# not a number
not1 = nan
not2 = +nan
not3 = -nan

30.2.3. Dates and Times

# offset datetime
odt1 = 1979-05-27T07:32:00Z
odt2 = 1979-05-27T00:32:00-07:00
odt3 = 1979-05-27T00:32:00.999999-07:00

# local datetime
ldt1 = 1979-05-27T07:32:00
ldt2 = 1979-05-27T00:32:00.999999

# local date
ld1 = 1979-05-27

# local time
lt1 = 07:32:00
lt2 = 00:32:00.999999

30.2.4. array and table

  • Key/value pairs within tables are not guaranteed to be in any specific order.
  • only contain ASCII letters, ASCII digits, underscores, and dashes (A-Za-z0-9_-). Note that bare keys are

allowed to be composed of only ASCII digits, e.g. 1234, but are always interpreted as strings.

  • Quoted keys
key = # INVALID
first = "Tom" last = "Preston-Werner" # INVALID
1234 = "value"
"127.0.0.1" = "value"

= "no key name"  # INVALID
"" = "blank"     # VALID but discouraged
'' = 'blank'     # VALID but discouraged

fruit.name = "banana"     # this is best practice
fruit. color = "yellow"    # same as fruit.color
fruit . flavor = "banana"   # same as fruit.flavor

# DO NOT DO THIS - Defining a key multiple times is invalid.
name = "Tom"
name = "Pradyun"
# THIS WILL NOT WORK
spelling = "favorite"
"spelling" = "favourite"

# This makes the key "fruit" into a table.
fruit.apple.smooth = true
# So then you can add to the table "fruit" like so:
fruit.orange = 2

# THE FOLLOWING IS INVALID
fruit.apple = 1
fruit.apple.smooth = true

integers = [ 1, 2, 3 ]
colors = [ "red", "yellow", "green" ]
nested_arrays_of_ints = [ [ 1, 2 ], [3, 4, 5] ]
nested_mixed_array = [ [ 1, 2 ], ["a", "b", "c"] ]
string_array = [ "all", 'strings', """are the same""", '''type''' ]

# Mixed-type arrays are allowed
numbers = [ 0.1, 0.2, 0.5, 1, 2, 5 ]
contributors = [
  "Foo Bar <foo@example.com>",
  { name = "Baz Qux", email = "bazqux@example.com", url = "https://example.com/bazqux" }
]
integers2 = [
  1, 2, 3
]

integers3 = [
  1,
  2, # this is ok
]

[table-1]
key1 = "some string"
key2 = 123

[table-2]
key1 = "another string"
key2 = 456

[a.b.c]            # this is best practice
[ d.e.f ]          # same as [d.e.f]
[ g .  h  . i ]    # same as [g.h.i]
[ j . "ʞ" . 'l' ]  # same as [j."ʞ".'l']

30.3. pyproject.toml

consis of

folder structure https://packaging.python.org/en/latest/tutorials/packaging-projects/

30.3.1. [build-system]

Hatch

requires = ["hatchling"]
build-backend = "hatchling.build"

setuptools

requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"

Flit

requires = ["flit_core>=3.4"]
build-backend = "flit_core.buildapi"

PDM

requires = ["pdm-backend"]
build-backend = "pdm.backend"

30.3.2. metadata [project] and [project.urls]

pep 621 - [project] and https://packaging.python.org/en/latest/specifications/declaring-project-metadata/#declaring-project-metadata

[project]
name = "example_package_YOUR_USERNAME_HERE"
version = "0.0.1"
authors = [
  { name="Example Author", email="author@example.com" },
] # optional?
description = "A small example package"
readme = "README.md"
license = {file = "LICENSE.txt"} # optional
keywords = ["egg", "bacon", "sausage", "tomatoes", "Lobster Thermidor"] # optional
requires-python = ">=3.7"
classifiers = [
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: MIT License",
    "Operating System :: OS Independent",
]
dependencies = [
  "httpx",
  "gidgethub[httpx]>4.0.0",
  "django>2.1; os_name != 'nt'",
  "django>2.0; os_name == 'nt'",
] # optional

[project.optional-dependencies]
gui = ["PyQt5"]
cli = [
  "rich",
  "click",
]


[project.urls]
"Homepage" = "https://github.com/pypa/sampleproject"
"Bug Tracker" = "https://github.com/pypa/sampleproject/issues"

[project.scripts]
spam-cli = "spam:main_cli"

30.3.3. [project.scripts]

mycmd = mymod:main

would create a command mycmd launching a script like this:

import sys
from mymod import main
sys.exit(main())

main should return 0

  1. links

30.3.4. dependencies

30.4. build

python3 -m build

create: dist/

  • ├── example_package_YOUR_USERNAME_HERE-0.0.1-py3-none-any.whl - built distribution with binaries
  • └── example_package_YOUR_USERNAME_HERE-0.0.1.tar.gz - source distribution

30.5. distutils (old)

package has been deprecated in 3.10 and will be removed in Python 3.12. Its functionality for specifying package builds has already been completely replaced by third-party packages setuptools and packaging, and most other commonly used APIs are available elsewhere in the standard library (such as platform, shutil, subprocess or sysconfig).

30.6. terms

  • Source Distribution (or “sdist”) - generated using python setup.py sdist.
  • Wheel - A Built Distribution format
  • build - is a PEP 517 compatible Python package builder.
    • pep517 - new style of source tree based around the pep518 pyproject.toml + [build-backend]
  • setup.py-style - de facto specification for "source tree"
  • src-layout - not flat layout. selected for package folder structure. pep 660

types of artifacts:

  • The source distribution (sdist): python3 -m build –sdist source-tree-directory
  • The built distributions (wheels): python3 -m build –wheel source-tree-directory
    • no compilation required during install:

30.7. recommended

dapendency management:

  • pip with –require-hashes and –only-binary :all:
  • virtualenv or venv
  • pip-tools, Pipenv, or poetry
  • wheel project - offers the bdist_wheel setuptools extension
  • buildout: primarily focused on the web development community
  • Spack, Hashdist, or conda: primarily focused on the scientific community.

package tools

  • setuptools
  • build to create Source Distributions and wheels.
  • cibuildwheel - If you have binary extensions and want to distribute wheels for multiple platforms
  • twine - for uploading distributions to PyPI.

30.8. Upload to the package distribution service

30.8.1. TODO twine

twine upload dist/package-name-version.tar.gz dist/package-name-version-py3-none-any.whl

30.8.2. TODO Github actions

30.9. editable installs PEP660

pip install --editable

editable installation mode - installation of projects in such a way that the python code being imported remains in the source directory

Python programmers want to be able to develop packages without having to install (i.e. copy) them into site-packages, for example, by working in a checkout of the source repository.

Actualy just add directories to PYTHONPATH.

there is 2 types of wheel now: normal and "editable".

30.10. PyPi project name, name normalization and other specifications

names should be ASCII alphabet, ASCII numbers. ., -, and _ allowed, but normalized to -.

  • normalized to
  • lowercase

Valid non-normalized names: ^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$

Normalization: re.sub(r"[-_.]+", "-", name).lower()

Source distribution format - pep-0517 PEP 518

  • Source distribution file name: {name}-{version}.tar.gz
  • contains a single top-level directory called {name}-{version} (e.g. foo-1.0), containing the source files of the package.
  • directory must also contain
    • a pyproject.toml
    • PKG-INFO file containing metadata - PEP 566

30.11. TODO src layout vs flat layout

30.12. links

31. setuptools - build system

32. pip (package manager)

Устанавливается вместе с Python

  • (pip3 for Python 3) by default - MIT -
  • pip.pypa.io

Some package managers, including pip, use PyPI as the default source for packages and their dependencies.

Python Package Index - official third-party software repository for Python

  • PyPI (ˌpaɪpiˈaɪ)

32.1. release steps

  1. register at pypi.org
  2. https://pypi.org/manage/account/#api-tokens
  3. github->project->Secrets and variables->actions
    • New repostitory secret
    • PYPI_API_TOKEN
    • token from 2)
  4. github->project->Actions->add->Publish Python Package

32.2. wheels

“Wheel” is a built, archive format that can greatly speed installation compared - .whl

to disable wheel:

  • –no-cache-dir
  • –no-binary=:all:

32.3. virtualenv

Может быть так, что проект А запрашивает версию 1.0.0, в то время как проект Б запрашивает более новую версию 2.0.0, к примеру.

  • не может различать версии в каталоге «site-packages»

pip install virtualenv

32.4. venv

создать:

python -m venv /path/to/new/virtual/environment
  • pyvenv.cfg - created
  • bin (or Scripts on Windows) containing a copy/symlink of the Python binary/binaries
  • в директории с интерпретатором или уровнем выше ищется файл с именем pyvenv.cfg;
  • если файл найден, в нём ищется ключ home, значение которого и будет базовой директорией;
  • в базовой директории идёт поиск системной библиотеки (по спец. маркеру os.py);

Использовать:

  • source bin/activate
  • ./bin/python main.py

32.5. update

pip3 install –upgrade pip –user

  • устаревшие: pip3 list –outdated
  • обновить: pip3 install –upgrade SomePackage

32.6. requirements.txt

Как установить

  • pip install -r requirements.txt

Как создать

  1. pip freeze > requirements.txt - Создать на основе всех установленных библиотек
  2. pipreqs . - на основе импортов - требует установку pip3 install pipreqs –user

Смотреть на кроссплатформенность! Не все библиотеки такие!

docopt == 0.6.1             # Version Matching. Must be version 0.6.1
keyring >= 4.1.1            # Minimum version 4.1.1
coverage != 3.5             # Version Exclusion. Anything except version 3.5
Mopidy-Dirble ~= 1.1        # Compatible release. Same as >= 1.1, == 1.*

# without version:
nose
nose-cov
beautifulsoup4

32.7. errors

Traceback (most recent call last): File "/usr/bin/pip3", line 9, in <module> from pip import main ImportError: cannot import name 'main'

SOLVATION: alias pip3="home/u2.local/bin/pip3"

32.8. cache dir

to reduce the amount of time spent on duplicate downloads and builds.

  • cached:
    • http responses
    • Locally built wheels
  • pip cache dir

32.9. hashes

  • pip install package –require-hashes
  • Requirements must be pinned with ==
  • weak hashes: md5, sha1, and sha224
  • python -m pip download –no-binary=:all: SomePackage
  • python -m pip hash –algorithm sha512 ./pip_downloads/SomePackage-2.2.tar.gz
  • pip install –force-reinstall –no-cache-dir –no-binary=:all: –require-hashes –user -r requirements.txt

FooProject == 1.2 –hash=sha256:2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824 \ –hash=sha256:486ea46224d1bb4fb680f34f7c9ad96a8f24ec88be73ea8e5a6c65260e9cb8a7

32.10. add SSL certificate

export PIP_CERT=/etc/ssl/certs/rnb.pem

Dockerfile:

  • COPY /etc/ssl/certs/rnb.pem /rnb.pem
  • ENV PIP_CERT=/rnb.pem

32.10.1. crt(not working)

  • pip config set global.cert path/to/ca-bundle.crt
  • pip config list
  • conda config –set ssl_verify path/to/ca-bundle.crt
  • conda config –show ssl_verify
  • git config –global http.sslVerify true
  • git config –global http.sslCAInfo path/to/ca-bundle.crt

https://stackoverflow.com/questions/39356413/how-to-add-a-custom-ca-root-certificate-to-the-ca-store-used-by-pip-in-windows

32.10.2. pem(not working)

pip config set global.cert /home/RootCA3.pem - указываем путь к самоподписномму серту, если возникают ошибки установки модулей питона.

  • python -c "import ssl; print(ssl.get_default_verify_paths())"
  • add pem to path

32.11. ignore SSL certificates

pip install –trusted-host pypi.org –trusted-host files.pythonhosted.org <package_name>

33. urllib3 and requests library

requests->urllib3->http.client

request parametes:

  • data - body with header: Content-Type: applicantion/x-www-form-urlencoded
  • params - ?param=value - urllib.quote(string)

33.1. difference

speed - I found that time took to send the data from the client to the server took same time for both modules (urllib, requests) but the time it took to return data from the server to the client is more then twice faster in urllib compare to request.

33.2. see raw request

33.2.1. requests

  1. 1) after request:

    hello, as!

    p = requests.post(f'http://127.0.0.1:8081/transcribe/{rid}/find_sentence', params={'sentences': sentences})
    print("----request:")
    [print(x) for x in p.request.__dict__.items()]
    

    #+

  2. 2) before request
    s = Session()
    req = Request('GET',  url, data=data, headers=headers)
    prepped = s.prepare_request(req)
    [print(x) for x in prepped.__dict__.items()]
    
  3. 3) after request from logs:
    import requests
    import logging
    
    # These two lines enable debugging at httplib level (requests->urllib3->http.client)
    # You will see the REQUEST, including HEADERS and DATA, and RESPONSE with HEADERS but without DATA.
    # The only thing missing will be the response.body which is not logged.
    try:
        import http.client as http_client
    except ImportError:
        # Python 2
        import httplib as http_client
    http_client.HTTPConnection.debuglevel = 1
    
    # You must initialize logging, otherwise you'll not see debug output.
    logging.basicConfig()
    logging.getLogger().setLevel(logging.DEBUG)
    requests_log = logging.getLogger("requests.packages.urllib3")
    requests_log.setLevel(logging.DEBUG)
    requests_log.propagate = True
    
    requests.get('https://httpbin.org/headers')
    

33.3. problems:

34. pdf 2 png

34.1. pdf2image

require poppler-utils

  • wraps pdftoppm and pdftocairo
  • to PIL image

34.2. Wand

pip3 install Wand

ImageMagic binding

34.3. PyMuPDF

pip3 install PyMuPDF

35. statsmodels

35.1. ACF, PACF

from statsmodels.graphics.tsaplots import plot_acf
from matplotlib import pyplot
series = read_csv('seasonally_adjusted.csv', header=None)
plot_acf(series, lags = 150) #  lag values along the x-axis and correlation on the y-axis between -1 and 1
plot_pacf(series) # не понять. короче, то же самое, только более короткие корреляции не мешают
pyplot.show()

35.2. bar plot

loan_type_count = data['Loan Type'].value_counts()
sns.set(style="darkgrid")
sns.barplot(loan_type_count.index, loan_type_count.values, alpha=0.9)

36. XGBoost

One natural regularization parameter is the number of gradient boosting iterations M (i.e. the number of trees in the model when the base learner is a decision tree).

36.1. usage

import xgboost as xgb

or

from xgboost import XGBClassifier - multi:softprob if classes > 2

for multiclass classification:

  • from sklearn.preprocessing import LabelBinarizer
  • y = np.array(['apple', 'pear', 'apple', 'orange'])
  • y_dense = LabelBinarizer().fit_transform(y) - [ [1 0 0],[0 0 1],[1 0 0],[0 0 1] ]

36.2. categorical columns

The politic of XGBoost is to not have a special support for categorical variables. It s up to you to manage them before providing the features to the algo.

If booster=='gbtree' (the default), then XGBoost can handle categorical variables encoded as numeric directly, without needing dummifying/one-hotting. Whereas if the label is a string (not an integer) then yes we need to comvert it.

36.2.1. Feature importance between numerical and categorical features

https://discuss.xgboost.ai/t/feature-importance-between-numerical-and-categorical-features/245

one-hot encoding. Consequently, each categorical feature transforms into N sub-categorical features, where N is the number of possible outcomes for this categorical feature.

Then each sub-categorical feature would compete with the rest of sub-categorical features and all numerical features. It is much easier for a numerical feature to get higher importance ranking.

What we can do is to set importance_type to weight and then add up the frequencies of sub-categorical features to obtain the frequency of each categorical feature.

36.3. gpu support

tree_method = 'gpu_hist'
gpu_id = 0  (optional)

36.4. result value from leaf value

The final probability prediction is obtained by taking sum of leaf values (raw scores) in all the trees and then transforming it between 0 and 1 using a sigmoid function. (1 / (1 + math.exp(-x)))

leaf = 0.1111119 #  raw score
result = 1/(1+ np.exp(-(leaf))) = 0.5394 # probability score -  logistic function

xgb.plot_tree(bst, num_trees=num_round-1) # default 0 tree

print(bst.predict(t, ntree_limit=1)) # first 0 tree, default - all

36.5. terms

  • instance or entity - line
  • feature - column
  • data - list of instances - 2D
  • labels - 1D list of labels for instances

36.6. xgb.DMatrix

  • LibSVM text format file
  • Comma-separated values (CSV) file
  • NumPy 2D array
  • SciPy 2D sparse array
  • cuDF DataFrame
  • Pandas data frame, and
  • XGBoost binary buffer file.
data = np.random.rand(5, 10)  # 5 entities, each contains 10 features
label = np.random.randint(2, size=5)  # binary target array([1, 0, 1, 0, 0])
dtrain = xgb.DMatrix(data, label=label)

# weights
w = np.random.rand(5, 1)
dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w)

36.6.1. LibSVM file format

1 101:1.2 102:0.03
0 1:2.1 10001:300 10002:400
  • Each line represent a single instance
  • 1,0 - labels - probability values in [0,1]
  • 101, 102 - feature indices
  • 1.2, 0.03 - feature values
xgb.DMatrix('/home/u2/Downloads/agaricus.txt.train')
xgb.DMatrix(train.csv?format=csv&label_column=0)

36.7. parameters

https://xgboost.readthedocs.io/en/latest/parameter.html

param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'}

objective:

  • 'binary:logistic' - labels [0,1] - output probability, binary

-'reg:squarederror' - regression with squared loss

  • multi:softmax multiclass classification using the softmax objective

'booster': 'gbtree' - gbtree and dart use tree based models while gblinear uses linear functions

eval_metric - rmse for regression, and error for classification, mean average precision for ranking

  • error - Binary classification #(wrong cases)/#(all cases)

'seed': 0 - random seed

gbtree

  • 'eta': 0.3 - learning_rate
  • 'max_depth': 6 - Maximum depth of a tree - more = more complex and more likely to overfit
  • 'gamma': 0 - Minimum loss reduction required to make a further partition on a leaf node of the tree. - to make more coservative

36.8. print important features

import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('TkAgg')

xgb.plot_importance(bst)
plt.show()

36.9. TODO prune обрезание деревьев

36.10. permutation importance

for XGBClassificator (binary) - sklearn.inspection.permutation_importance

other - shap values

36.12. Errors

36.12.1. ValueError: setting an array element with a sequence.

36.12.2. label must be in [0,1] for logistic regression

37. Natasha & Yargy

  • pip install jupyter
  • pip install yargy ipymarkup - подсветка разметки
  • jupyter.exe notebook
  • graphviz и вручную настроил PATH на его bin

37.1. yargy

Недостатки:

  • slow
  • не гибкий
  • нелья построить правила с условиями

37.1.1. yargy.tokenizer

from yargy.tokenizer import MorphTokenizer # используется по умолчанию
t = MorphTokenizer()
list(t('asds'))[0].value
list(t('asds'))[0].normalized

Его правила:

  • TokenRule('RU', '[а-яё]+'),
  • TokenRule('LATIN', '[a-z]+'),
  • TokenRule('INT', '\d+'),
  • TokenRule('PUNCT','[-\\/!#$%&()\[\]\*\+,\.:;<=>?@^_`{|}~№…"\'«»„“ʼʻ”]'),
  • TokenRule('EOL', '[\n\r]+'),
  • TokenRule('OTHER', '§')]

убрать часть правил: tokenizer = Tokenizer().remove_types('EOL')

37.1.2. rules

  • yargy.predicates- type('INT'), eq('г'), _or(normalized('ложка'), caseless('вилка')
  • yargy.rule - rule(predicates, …), or_
  • yargy.pipelines - газетти́р - список - конструктор правила
    • morph_pipeline(['л','г']) - перед работой приводит слова к нормальной форме
    • caseless_pipeline(['Абд Аль','и']) - перед работой приводит слова к нижнему регистру
  • yargy.interpretation.fact('название',['аттрибут', …]) - его используют предикаты для их интерпритации. - Интерпретация, это сварачивание дерева разбора снизу вверх.
    • attribute - значение по умолчанию для аттрибута и опреации над результатом
f = fact('name',[attribute('year', 2017)])
a=eq('100').interpretation(f.year.custom(произвольная фонкция одного аргумента))
r=rule(a).interpretation(f)
match.fact or match.tree.as_dot

37.1.4. предикаты

  • eq(value) a == b
  • caseless(value) a.lower() == b.lower()
  • in_(value) a in b
  • in_caseless(value) a.lower() in b
  • gte(value) a >= b
  • lte(value) a <= b
  • length_eq(value) len(a) == b
  • normalized(value) Нормальная форма слова == value
  • dictionary(value) Нормальная форма слова in value
  • gram(value) value есть среди граммем слова
  • type(value) Тип токена равен value
  • tag(value) Тег токена равен value
  • custom(function[, types]) function в качестве предиката
  • true Всегда возвращает True
  • is_lower str.islower
  • is_upper str.isupper
  • is_title str.istitle
  • is_capitalized Слово написано с большой буквы
  • is_single Слово в единственном числе

Сэты:

  • optional()
  • repeatable(min=None, max=None, reverse=False)
  • interpretation(a.a) - прикрепляет предикат к эллементу интерпретации

37.1.5. нестандартные формы слова - рулетики

  • Т библиотека?
  • уменьшительно ласкательные приводить к стандартной офрме, словарики?

37.1.6. ex

#------- правило в виде контекстно-свободной грамматики ----
from yargy import rule
R = rule('a','b')
R.normalized.as_bnf
>> R -> 'a' 'b'
#------- FLOAT -------
from yargy import rule, or_
from yargy.predicates import eq, type as _type, in_
INT = _type('INT')
FLOAT = rule(INT, in_(',.'), INT)
FRACTION = rule(INT, eq('/'), INT)
RANGE = rule(INT, eq('-'), INT)
AMOUNT = or_(
  rule(INT),
  FLOAT,
  FRACTION,
  RANGE)
#------- MorphTokenizer -----------
from yargy.tokenizer import MorphTokenizer
TOKE = MorphTokenizer()
l = list(TOKE(text))
for i in l: print('\n'.join(map(str, i)))
#--------- findall ----------
from yargy import rule, Parser
from yargy.predicates import eq

line = '100 г'

MEASURE = rule(eq(100))
parser = Parser(MEASURE.optional())
matches=list(parser.findall(line))
# --------- Simples ------
from yargy import rule, Parser
r = rule('a','b')
parser = Parser(r)
line = 'abc'
match = parser.match(line)
# ----------- spans  show --------
from ipymarkup import markup, AsciiMarkup

spans =[_.spam for _ in matches]
for line in markup(text, spans, AsciiMarkup).as_ascii:
    print(line)

37.1.7. natasha

Extractors:

  • NamesExtractor - NAME,tagger=tagger
  • SimpleNamesExtractor - SIMPLE_NAME
  • PersonExtractor - PERSON, tagger=tagger
  • DatesExtractor - DATE
  • MoneyExtractor - MONEY
  • MoneyRateExtractor - MONEY_RATE
  • MoneyRangeExtractor - MONEY_RANGE
  • AddressExtractor - ADDRESS, tagger=tagger
  • LocationExtractor - LOCATION
  • OrganisationExtractor - ORGANISATION

37.1.9. QT console

  • https://qtconsole.readthedocs.io/en/stable/
  • https://www.tutorialspoint.com/jupyter/ipython_introduction.htm
  • inline figures
  • proper multi-line editing with syntax highlighting
  • graphical calltips
  • emacs-style bindings for text navigation
  • HTML or XHTML
  • PNG(outer or inline) in HTML, or inlined as SVG in XHTML
  • Run: jupyter qtconsole –style monokai
  • ! - system command (!dir)
  • ? - a? - information about varible, plt?? - source definition, exit - q
  • In[2] - input string, Out[2] - out
  • display(object) display anythin supported
  • "*"*100500; - ; не видеть результат
  • Switch to SVG inline XHTML In [10]: %config InlineBackend.figure_format = 'svg'
  1. keys
    • Tab - autocompletion - Несклько раз нажать
    • ``Enter``: insert new line (may cause execution, see above).
    • ``Ctrl-Enter``: force new line, never causes execution.
    • ``Shift-Enter``: force execution regardless of where cursor is, no newline added.
    • ``Up``: step backwards through the history.
    • ``Down``: step forwards through the history.
    • ``Shift-Up``: search backwards through the history (like ``Control-r`` in bash).
    • ``Shift-Down``: search forwards through the history.
    • ``Control-c``: copy highlighted text to clipboard (prompts are automatically stripped).
    • ``Control-Shift-c``: copy highlighted text to clipboard (prompts are not stripped).
    • ``Control-v``: paste text from clipboard.
    • ``Control-z``: undo (retrieves lost text if you move out of a cell with the arrows).
    • ``Control-Shift-z``: redo.
    • ``Control-o``: move to 'other' area, between pager and terminal.
    • ``Control-l``: clear terminal.
    • ``Control-a``: go to beginning of line.
    • ``Control-e``: go to end of line.
    • ``Control-u``: kill from cursor to the begining of the line.
    • ``Control-k``: kill from cursor to the end of the line.
    • ``Control-y``: yank (paste)
    • ``Control-p``: previous line (like up arrow)
    • ``Control-n``: next line (like down arrow)
    • ``Control-f``: forward (like right arrow)
    • ``Control-b``: back (like left arrow)
    • ``Control-d``: delete next character, or exits if input is empty
    • ``Alt-<``: move to the beginning of the input region.
    • ``alt->``: move to the end of the input region.
    • ``Alt-d``: delete next word.
    • ``Alt-Backspace``: delete previous word.
    • ``Control-.``: force a kernel restart (a confirmation dialog appears).
    • ``Control-+``: increase font size.
    • ``Control–``: decrease font size.
    • ``Control-Alt-Space``: toggle full screen. (Command-Control-Space on Mac OS X)
  2. magic
    • %lsmagic - Displays all magic functions currently available
    • %cd
    • %pwd
    • %dhist - directories you have visited in current session
    • %notebook - history to into an IPython notebook file with ipynb extension
    • %precision n - n after ,
    • %recall n - execute preview command or n command
    • %run a.py - run file, - замерить время выполнения (-t), запустить с отладчиком (-d) или профайлером (-p)
      • %run -n main.py - import
    • %time command - displays time required by IPython environment to execute a Python expression
    • %who type - у каких переменнх такой-то тип
    • %whos - все импортированные и созданные объекты
    • %hist - вся история в виде текста
    • %rep n - переход на n ввод

    Python

    • %pdoc - документацию
    • %pdef - определение функции
    • %psource - исходный код функции, класса
    • %pfile - полный код файла соответственно
  3. TEMPLATE
    #------ TEMPLATE ---------------
    # QTconsole ----
    In [1]: run -n main.py
    
    In [2]: main()
    
    In [3]: from yargy import rule, Parser
    from yargy.predicates import eq, type as _type, normalized
    MEASURE = rule(eq('НДС'))
    parser = Parser(MEASURE)
    for line in words:
        matches = list(parser.findall(line))
        spans = [_.span for _ in matches]
        mup(line, spans)
    # main.py ------
    #my
    import read_json
    
    
    # -- test
    words :list = [] #words from file
    index :int = 0
    # test --
    
    def mup(s :str, spans:list):
        """ выводит что поматчилось на строке """
        from ipymarkup import markup, AsciiMarkup
        for line in markup(s, spans, AsciiMarkup).as_ascii:
            print(line)
    
    def work(prov :dict):
        """вызывается для каждой строки """
        text = prov['naznach']
        # -- test
        global words, index
        words.append(text)
        index +=1
        if index >5: quit()
        # test --
    
    
    def main():#args):
        read_json.readit('a.txt', work) #aml_provodki.txt
    #################### MAIN ##########################
    if __name__ == '__main__':  #name of module-namespace.  '__main__' for - $python a.py
         #import sys
         main()#sys.argv)
         quit()
    
  4. Other
    #--------- yargy to graphviz ------------
    from ipymarkup import markup, show_markup
    spans = [_.span for _ in matches]
    show_markup(line,spans)
    
    r = rule(...
    r.normalized.as_bnf
    
    
    match.tree.as_dot
    # ----------- случайная выборка строк для теста ----
    from random import seed, sample
    seed(1)
    sample(lines, 20)
    
    
    OR
    from random import sample
    
    for a in sample(range(0,20), 2):
        print(a)
    #-------- matplotlib --------
    from matplotlib import pyplot as plt
    plt.plot(range(10),range(10))
    

37.1.10. graphviz

https://stackoverflow.com/questions/41942109/plotting-the-digraph-with-graphviz-in-python-from-dot-file

https://www.youtube.com/watch?time_continue=1027&v=NQxzx0qYgK8

m.tree.as_dot._repr_svg_() - выдает что-то для graphiz

37.1.11. IPython

38. Stanford NER - Java

38.1. train

You give the data file, the meaning of the columns, and what features to generate via a properties file.

38.2. Ttraining data

  • Dataturks NER tagger

39. DeepPavlov

Валентин Малых, Алексей Лымарь, МФТИ

  • агенты ведут диалог с пользователем,
  • у них есть скилы, которые выбираются. - это набор компонентов - spellchecker, morphanalizer, классификатор интентов
  • скил - their input and output should both be strings
  • компоненты могут объединяться в цепочку, похожую на pipeline spacy

Компоненты - могут быть вложенными:

  • нет синтаксич парсера
  • Question Answering вопросно-ответная система
  • NER и Slot filling
  • Classification
  • Goal-oriented bot
  • Spellchecker
  • Morphotagger

39.1. Коммандная-строка

python .\deeppavlov\deep.py interact ner_rus [-d]

  • взаимодействие, тестирование
  • ner_rus - C:\Users\Chepilev_VS\AppData\Local\Programs\Python\Python36\lib\site-packages\deeppavlov\configs\ner\ner_rus.json

39.2. вспомогательные классы

  • simple_vocab
    • self._t2i[token] = self.count - индексы токенов
    • self._i2t.append(token) - токены индексов

39.3. in code

#------------ build model and interact ---------
from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model

faq = build_model(configs.faq.tfidf_logreg_en_faq, download = True)
a = faq(["I need help"])

39.4. installation

  • apt install libssl-dev libncurses5-dev libsqlite3-dev libreadline-dev libtk8.5 libgdm-dev libdb4o-cil-dev libpcap-dev

wget https://www.python.org/ftp/python/3.6.8/Python-3.6.8rc1.tgz

  • tar -xvzf
  • cd Python-3.6.8
  • ./configure –enable-optimizations –with-ensurepip=install
  • make -j8
  • sudo make altinstall
  • python3.6
  • update-alternatives –install /usr/bin/python python /usr/bin/python3.6 1
  • update-alternatives –config python
  • python -m pip install –upgrade pip
  • git config –global http.proxy http://srv-proxy:8080
  • git clone https://github.com/deepmipt/DeepPavlov.git

ver 1

  • pip3.6 install virtualenv –user
  • ~/.local/bin/virtualenv ENV
  • source ENV/bin/activate

var 2

  • python -m venv .
  • source bin/activate
  • pip install deeppavlov
  • ENV/bin/python

fastText

pip install git+https://github.com/facebookresearch/fastText.git#egg=fastText==0.8.22

install everything required by a specific DeepPavlov config by running:

python -m deeppavlov install <config_name>

МОИ ФИКСЫ https://github.com/vitalij23/DeepPavlov/commits/master

  • JSON с комментами:
    • pip3.6 install jstyleson
    • deeppavlov\core\common\file.py json ->jstyleson

39.5. training

we use BIO or IOB (Inside–outside–beginning) - It subdivides the in tags as either being begin-of-entity (B_X) or continuation-of-entity (I_X).

dataset

train
data for training the model;
validation
data for evaluation and hyperparameters tuning;
test
data for final evaluation of the model.

Обучение состоит из 3 элементов dataset_reader, dataset_iterator and train. Или хотя бы двух dataset and train.

dataset_reader - источник x и у

Прото-Классы dataset_iterator:

  • Estimator - no early stopping, safely done at the time of pipeline initialization. in both supervised and unsupervised settings
    • fit()
  • NNModel - Обучение с учителем (supervised learning);
    • in
    • in_y

Обучение:

  • rm -r ~/.deeppavlov/models/ner_rus
  • cd deep
  • source ENV/bin/activate
  • python3.6 -m deeppavlov train ~/ner_rus.json

39.6. NLP pipeline json config

https://deeppavlov.readthedocs.io/en/0.1.6/intro/config_description.html Используется core/common/registry.json

  • Если у компонента указать id с именем, то по этому имени можно не создавать, а сослаться на него: "ref": "id_name"

Four main sections:

  • dataset_reader
  • dataset_iterator
  • chainer - one required element
    • in
    • pipe
      • in
      • out
    • out
  • train

"metadata": {"variables" - определеяет пути "DOWNLOADS_PATH" "MODELS_PATH" и т.д.

39.6.1. configs

ner_conll2003.json glove
ner_conll2003_pos.json glove
ner_dstc2.json random_emb_mat
ner_few_shot_ru.json elmo_embedder
ner_few_shot_ru_simulate.json elmo_embedder
ner_ontonotes.json glove
ner_rus.json fasttext
slotfill_dstc2.json nothing
slotfill_dstc2_raw.json nothing

39.6.2. parsing anal

from deeppavlov import configs
from deeppavlov.core.commands.utils import parse_config
config_dict = parse_config(configs.ner.ner_ontonotes)
print(config_dict['dataset_reader']['data_path'])

39.6.3. json

{
  "deeppavlov_root": ".",
  "dataset_reader": { //deeppavlov\dataset_readers
    "class_name": "conll2003_reader",  //conll2003_reader.py
    "data_path": "{DOWNLOADS_PATH}/total_rus/", //папка откуда брать train.txt, valid.txt, test.txt
    "dataset_name": "collection_rus", //если папка пустая то используется ссылка внутри conll2003_reader.py
    "provide_pos": false //pos tag?
  },
  "dataset_iterator": { //deeppavlov\dataset_iterators
    //For simple batching and shuffling
    "class_name": "data_learning_iterator", //deeppavlov\core\data\data_learning_iterator.py
    "shuffle": true, //по умолчанию перемешивает List[Tuple[Any, Any]]
    "seed": 42 //seed for random shuffle
  },
  "chainer": {  //list of components - core\common\chainer.py
    "in": ["x"], //names of inputs for pipeline inference mode
    "in_y": ["y"], //names of additional inputs for pipeline training and evaluation modes
    "out": ["x_tokens", "tags"], //names of pipeline inference outputs
    "pipe": [  //
    {
      "class_name": "tokenizer",
      "in": "x", //in of chainer
      "lemmas": true, // lemmatizer enabled
      "out": "q_token_lemmas"
    },

39.6.4. examples

  1. tokenizer

    x::As a'd.234 4567 >> ['as', "a'd.234", '4567']

    {
      "chainer": {
        "in": [ "x" ],
        "in_y": [ "y" ],
        "pipe": [
          {
            "class_name": "str_lower",
            "id": "lower",
            "in": [ "x" ],
            "out": [ "x_lower" ]
          },
          {
            "in": [ "x_lower" ],
            "class_name": "lazy_tokenizer",
            "out": [ "x_tokens" ]
          },
          {
            "in": [ "x_tokens" ],
            "class_name": "sanitizer",
            "nums": false,
            "out": [ "x_san" ]
          }
        ],
        "out": [ "x_san" ]
      }
    }
    

39.7. prerocessors

  • sanitizer - \models\preprocessors Remove all combining characters like diacritical marks from tokens deeppavlov\models\preprocessors\sanitizer.py
    • nums - Replace [0-9] - 1 и ниибет
  • str_lower - batch.lower()

39.7.1. tokenizers

deeppavlov\models\tokenizers

  • lazy_tokenizer - english nltk word_tokenize (нет параметров)
  • ru_tokenizer - lowercase - съедает точку вместе со словом
    • stopwords - List[str]
    • ngram_range - List[int] - size of ngrams to create; only unigrams are returned by default
    • lemmas - default=False - whether to perform lemmatizing or not
  • nltk_moses_tokenizer - MosesTokenizer().tokenize - как lazy_tokenizer, если вход токены - то склеивает.
    • escape = False - если True заменяет | [] < > [ ] & на '&#124;', '&#91;', '&#93;', '&lt;', '&gt;', '&#91;',

39.7.2. Embedder [ɪmˈbede] - Deep contextualized word reprezentation

  • "Words that occur in similar contexts tend to have similar meaning"
  • Consist of embedding matrices.
  • Converts every token to a vector of particular dimensionality
  • Vocabularies allow conversion from tokens to indices is needed to perform lookup in embeddings matrices and compute cross-entropy between predicted probabilities and target values.
  • Для: (eg Cosine) similarity - as a measure of semantic simularity
  • unsupervised learning algorithm

Classes

  • glove_emb - GloVe (Stanford) - by factorizing the logarithm of the corpus word co-occurrence matrix https://github.com/maciejkula/glove-python
  • ELMo - Embeddings from Language Models
    • whole sentences as context
  • fastText - By default, we use 100 dimensions
    • skip-gram - learns to predict using a random close-by word - skipgram models works better with subword information than cbow.
      • designed to predict the context
      • works well with small amount of the training data, represents well even rare words or phrases.
      • slow
    • cbow - according to its context - uses the sum of their vectors to predict the target
      • learning to predict the word by the context. Or maximize the probability of the target word by looking at the context
      • there is problem for rare words.
      • several times faster to train than the skip-gram, slightly better accuracy for the frequent words
  1. GloVe (Stanford)

    Global Vectors for Word Representation

    Goal: create a glove model X pip3 install https://github.com/JonathanRaiman/glove/archive/master.zip

    glovepy

    • corpus.py - Cooccurrence matrix construction tools for fitting the GloVe model.
    • glovepy.py - Glove(object) - Glove model for obtaining dense embeddings from a co-occurence (sparse) matrix.
  2. fastText skip-gram model

    Without subwords: ./fasttext skipgram -input data/fil9 -output result/fil9-none -maxn 0 -ws 30 -dim 300

    "class_name": "fasttext", deeppavlov\models\embedders\fasttext_embedder.py

39.8. components

  • simple_vocab - For holding sets of tokens, tags, or characters - \core\data\simple_vocab.py
    • id - the name of the vocabulary which will be used in other models
    • fit_on - out у предыдущего
    • save_path - path to a new file to save the vocabulary
    • load_path - path to an existing vocabulary (ignored if there is no files)
    • pad_with_zeros: whether to pad the resulting index array with zeros or not
    • out - indices

39.9. Models

  • Rule-based Models cannot be trained.
  • Machine Learning Models can be trained only stand alone.
  • Deep Learning Models can be trained independently and in an end-to-end mode being joined in a chain.

У каждой модели своя архитектура - CNN у или LSTM+CRF

39.10. speelcheking

based on context with the help of a kenlm language model

две pipeline

https://github.com/deepmipt/DeepPavlov/blob/0.1.6/deeppavlov/configs/spelling_correction/levenshtein_corrector_ru.json

  • Damerau-Levenshtein distance to find correction candidates
  • Нет тренера
    • вход x разбитый на токены и в нижнем регистре
    • Файла:
      1. russian_words_vocab.dict - "слово 1" - без ё
      2. ru_wiyalen_no_punkt.arpa.binary - kenlm language model?
    • simple_vocab — слово\tчастота - файл 1)
    • главный deeppavlov.models.spelling_correction.levenshtein.searcher_component:LevenshteinSearcherComponent
      • x_tokens -> tokens_candidates
      • words - vacabulary - файл 1)
      • max_distance = 1
      • инициализирует LevenshteinSearcher по словарю - возвращает близкие слова и дистанцию до них
      • (0, word) - для пунктуаций
      • error_probability = 1e-4 = 0.0001
      • выдает мама: [(-4,'мара'),(-8,'мама')]
    • deeppavlov.models.spelling_correction.electors.kenlm_elector:KenlmElector spelling_correction\electors\kenlm_elector.py
      • 2)
      • выбирает лучший вариант с учетом 2) файла, даже с маньшим фактором от Levenshtein

https://github.com/deepmipt/DeepPavlov/blob/0.1.6/deeppavlov/configs/spelling_correction/brillmoore_kartaslov_ru.json

  • statistic error model
  • "dataset_iterator": deeppavlov\dataset_iterators\typos_iterator.py наследник DataLearningIterator
  • "dataset_reader" :
  • Есть тренер
    • вход x, у - разбиваются на токены и в нижнем регистре
    • Файла:
      1. error_model.tar.gz/error_model_ru.tsv
      2. {DOWNLOADS_PATH}/vocabs
      3. ru_wiyalen_no_punkt.arpa.binary - kenlm language model?
    • главный spelling_error_model наследник Estimator 1) - deeppavlov.models.spelling_correction.brillmoore.error_model:ErrorModel
      • "fit_on" - x, y
      • in - x
      • out - tokens_candidates
      • error_model_ru.tsv "лицо ло 0.060606060606060615"
      • dictionary: class russian_words_vocab DeepPavlov\deeppavlov\vocabs\typos.py - Tie tree
        • 2)
    • deeppavlov.models.spelling_correction.electors.kenlm_elector:KenlmElector
      • 3)

Первый spelling_error_model

39.10.1. Tie vocabulary

Префиксное дерево - по буквам разные слова в дереве. https://ru.wikipedia.org/wiki/%D0%9F%D1%80%D0%B5%D1%84%D0%B8%D0%BA%D1%81%D0%BD%D0%BE%D0%B5_%D0%B4%D0%B5%D1%80%D0%B5%D0%B2%D0%BE

39.11. Classification

  1. keras_classification_model - neural network on Keras with tensorflow - deeppavlov.models.classifiers.KerasClassificationModel
    • cnn_model – Shallow-and-wide CNN with max pooling after convolution,
    • dcnn_model – Deep CNN with number of layers determined by the given number of kernel sizes and filters,
    • cnn_model_max_and_aver_pool – Shallow-and-wide CNN with max and average pooling concatenation after convolution,
    • bilstm_model – Bidirectional LSTM,
    • bilstm_bilstm_model – 2-layers bidirectional LSTM,
    • bilstm_cnn_model – Bidirectional LSTM followed by shallow-and-wide CNN,
    • cnn_bilstm_model – Shallow-and-wide CNN followed by bidirectional LSTM,
    • bilstm_self_add_attention_model – Bidirectional LSTM followed by self additive attention layer,
    • bilstm_self_mult_attention_model – Bidirectional LSTM followed by self multiplicative attention layer,
    • bigru_model – Bidirectional GRU model.

Please, pay attention that each model has its own parameters that should be specified in config.

  1. sklearn_component - sklearn classifiers - deeppavlov.models.sklearn.SklearnComponent

configs/classifiers:

JSON Frame Embedder Dataset Lang model comment
insults_kaggle.json keras fasttext basic      
insults_kaggle_bert.json bert_classifier ? basic     new 0.2.0
intents_dstc2.json keras fasttext dstc2      
intents_dstc2_bert.json            
intents_dstc2_big.json keras fasttext dstc2      
intents_sample_csv.json            
intents_sample_json.json            
intents_snips.json keras fasttext SNIPS   cnn_model  
intents_snips_big.json            
intents_snips_sklearn.json            
intents_snips_tfidf_weighted.json            
paraphraser_bert.json            
rusentiment_bert.json     basic ru    
rusentiment_cnn.json keras fasttext basic ru cnn_model  
rusentiment_elmo.json keras elmo basic ru    
sentiment_twitter.json keras fasttext basic ru    
sentiment_twitter_preproc.json keras fasttext basic ru    
topic_ag_news.json            
yahoo_convers_vs_info.json keras elmo   en   no reader and iterator

one_hotter - in(y)out(y) - given batch of list of labels to one-hot representation

39.11.1. bert

Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.

Pre-trained representations:

  • context-free - word2vec or GloVe
  • contextual - based on the other words in the sentence
    • unidirectional
    • bidirectional

json:

  • bert_preprocessor in(x)
  • one_hotter in(y)
  • bert_classifier x y
  • proba2labels - probas to id
  • classes_vocab - id to labels

39.11.2. iterators

39.12. NER - componen

conll2003_reader dataset_reader - BIO

  • "data_path": - three files, namely: “train.txt”, “valid.txt”, and “test.txt”

Models:

  • "ner": "deeppavlov.models.ner.network:NerNetwork",
  • "ner_bio_converter": "deeppavlov.models.ner.bio:BIOMarkupRestorer",
  • "ner_few_shot_iterator": "deeppavlov.dataset_iterators.ner_few_shot_iterator:NERFewShotIterator",
  • "ner_svm": "deeppavlov.models.ner.svm:SVMTagger",

preprocess

  • ХЗ random_emb_mat deeppavlov.models.preprocessors.random_embeddings_matrix:RandomEmbeddingsMatrix
  • "mask": "deeppavlov.models.preprocessors.mask:Mask"

deeppavlov.models.ner.network - когда ответ после всех или для каждого

  • use_cudnn_rnn - true TF layouts build on - NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.
  • net_type - rnn
  • cell_type - lstm

"in": ["x_emb", "mask", "x_char_ind", "cap"],

  • x_emb - token of fastText

39.13. Custom component

  • \deeppavlov\core\common\registry.json

40. AllenNLP

41. spaCy

42. fastText

By default, we use 100 dimensions

  • skip-gram - learns to predict using a random close-by word - skipgram models works better with subword information than cbow.
    • designed to predict the context
    • works well with small amount of the training data, represents well even rare words or phrases.
    • slow
    • better for rare slow
  • cbow - according to its context - uses the sum of their vectors to predict the target
    • learning to predict the word by the context. Or maximize the probability of the target word by looking at the context
    • there is problem for rare words.
    • several times faster to train than the skip-gram, slightly better accuracy for the frequent words

./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim 300

  • dim dimensions - default 100
  • -minn 2 -maxn 5 - substrings contained in a word between the minimum size (minn) and the maximal size
  • -ws size of the context window [5]

-epoch number of epochs [5]

result

  • bin stores the whole fastText model and can be subsequently loaded
  • vec contains the word vectors, one per line for each word in the vocabulary. The first line is a header containing the number of words and the dimensionality of the vectors.

Проверка:

  • ./fasttext nn result/fil9.bin
  • ./fasttext analogies result/fil9.bin

42.1. install

43. TODO rusvectores

44. Natural Language Toolkit (NLTK)

  • http://www.nltk.org/
  • API http://www.nltk.org/genindex.html
  • nltk.download('averaged_perceptron_tagger_ru') - russian. The NLTK corpus and module downloader.
    • Корпус corpus - набор слов http://www.nltk.org/howto/corpus.html
      • nltk.corpus.abc.words() - примерн окакие слова там C:\Users\Chepilev_VS\AppData\Roaming\nltk_data
      • for w in nltk.corpus.genesis.words('english-web.txt'): print(w) - все слова
      • Plaintext Corpora
      • Tagged Corpora - ex. part-of-speech tags - (word,tag) tuples
    • Tagger
    • >>> nltk.download('book') - >>> from nltk.book import * - >>> text1
  corpus standardized interfaces to corpora and lexicons
String processing tokenize, stem tokenizers, sentence tokenizers, stemmers
Collocation discovery collocations t-test, chi-squared, point-wise mutual information
Part-of-speech tagging tag n-gram, backoff, Brill, HMM, TnT
Machine learning classify, cluster, tbl decision tree, maximum entropy, naive Bayes, EM, k-means
Chunking chunk regular expression, n-gram, named-entity
Parsing parse, ccg chart, feature-based, unification, probabilistic, dependency

44.1. collocations

nltk.collocations.BigramCollocationFinder

  • from_words([sequence of words], bigram_fdm, window_size=2)=>finder - '.', ',',':' - разделяет

AbstractCollocationFinder

  • nbest(funct, n)=>[] top n ngrams when scored by the given function
  • finder.apply_freq_filter(min_freq) - the minimum number of occurrencies of bigrams to take into consideration
  • finder.apply_word_filter(lambda w: w = '.' or w = ',') - Removes candidate ngrams (w1, w2, …) where any of (fn(w1), fn(w2), …) evaluates to True.

44.2. Association measures for collocations (measure functions)

bigram_measures.student_t
Student's t
bigram_measures.chi_sq
Chi-square
bigram_measures.likelihood_ratio
Likelihood ratios
bigram_measures.pmi Pointwise Mutual Information
bigram_measures.pmi
raw_freq
Scores ngrams by their frequency
(no term)
::
(no term)
w2 (w2w1) (o w1) = n_xi
(no term)
~w2 (w2 o)
(no term)
= n_ix TOTAL = n_xx
#(n_ii, (n_ix, n_xi), n_xx):
>>> import nltk
>>> from nltk.collocations import *
>>> bigram_measures = nltk.collocations.BigramAssocMeasures()
>>>print('%0.4f' % bigram_measures.student_t(1, (2, 2), 4))
0
>>> print('%0.4f' % bigram_measures.student_t(1, (2, 2), 8))
0.5000

44.4. Корпус русского языка

45. pymorphy2

https://pymorphy2.readthedocs.io/en/latest/user/grammemes.html

  • grammeme - Грамме́ма - один из элементов грамматической категории - граммемы: tag=OpencorporaTag('NOUN,inan,masc plur,nomn')
  • используется словарь http://opencorpora.org/
  • для незнакомых слов строятся гипотезы
  • полностью поддерживается буква ё
  • Лицензия - MIT

46. linux NLP

46.1. count max words in line of file

MAX=0; file="/path";
while read -r line; do if [[ $(echo $line | wc -w ) -gt $MAX ]]; then MAX=$(echo $line | wc -w ); fi; done < "$file"

47. fuzzysearch

pip install –force-reinstall –no-cache-dir –no-binary=:all: –require-hashes –user -r file.txt

fuzzysearch==0.7.3 --hash=sha256:d5a1b114ceee50a5e181b2fe1ac1b4371ac8db92142770a48fed49ecbc37ca4c
attrs==22.2.0 --hash=sha256:c9227bfc2f01993c03f68db37d1d15c9690188323c067c641f1a35ca58185f99

47.1. typesense

47.1.1. pip3 install typesense –user

usr/lib/python3/dist-packages/secretstorage/dhcrypto.py:15: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes /usr/lib/python3/dist-packages/secretstorage/util.py:19: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes Collecting typesense Downloading typesense-0.15.0-py2.py3-none-any.whl (30 kB) Requirement already satisfied: requests in ..local/lib/python3.8/site-packages (from typesense) (2.28.1) Requirement already satisfied: idna<4,>=2.5 in ./.local/lib/python3.8/site-packages (from requests->typesense) (3.4) Requirement already satisfied: certifi>=2017.4.17 in ./.local/lib/python3.8/site-packages (from requests->typesense) (2022.12.7) Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./.local/lib/python3.8/site-packages (from requests->typesense) (1.26.13) Requirement already satisfied: charset-normalizer<3,>=2 in ./.local/lib/python3.8/site-packages (from requests->typesense) (2.1.1) Installing collected packages: typesense Successfully installed typesense-0.15.0

48. Audio - librosa

librosa uses soundfile and audioread for reading audio.

48.1. generic audio characteristics

  • Channels: number of channels; 1 for mono, 2 for stereo audio
  • Sample width: number of bytes per sample; 1 means 8-bit, 2 means 16-bit
  • Frame rate/Sample rate: frequency of samples used (in Hertz)
  • Frame width or Bit depth: Number of bytes for each “frame”. One frame contains a sample for each channel.
  • Length: audio file length (in milliseconds)
  • Frame count: the number of frames from the sample
  • Intensity: loudness in dBFS (dB relative to the maximum possible loudness)

48.2. load

default: librosa.core.load(path, sr=22050, mono=True, offset=0.0, duration=None, dtype=<class 'numpy.float32'>, res_type='kaiser_best')

  • sr is the sampling rate (To preserve the native sampling rate of the file, use sr=None.)
  • mono is the option (true/ false) to convert it into mono file.
  • offset is a floating point number which is the starting time to read the file
  • duration is a floating point number which signifies how much of the file to load.
  • dtype is the numeric representation of data can be float32, float16, int8 and others.
  • res_type is the type of resampling (one option is kaiser_best)
import librosa
y: np.array
y, sample_rate = librosa.load(filename, sr=None) # sampling rate as `sr` , y - time series
print("sample rate of original file:", sample_rate)
# -- Duration
print(librosa.get_duration(y))
print("duration in seconds", len(y)/sample_rate)


from IPython.display import Audio
Audio(data=data1,rate=sample_rate) # play audio

# --- for WAV files:
import soundfile as sf
ob = sf.SoundFile('example.wav')
print('Sample rate: {}'.format(ob.samplerate))
print('Channels: {}'.format(ob.channels))
print('Subtype: {}'.format(ob.subtype))

# --- mp3
import audioread
with audioread.audio_open(filename) as f:
    print(f.channels, f.samplerate, f.duration)

48.3. the Fourier transform - spectrum

import numpy as np
import librosa
import matplotlib.pyplot as plt

# filepath = '/home/u2/h4/PycharmProjects/whisper/1670162239-2022-12-04-16_57.mp3'
filepath = '/mnt/hit4/hit4user/gitlabprojects/captcha_fssp/app/929014e341a0457f5a90a909b0a51c40.wav'

y, sr = librosa.load(filepath)
librosa.fft_frequencies()
n_fft = 2048
ft = np.abs(librosa.stft(y[:n_fft], hop_length=n_fft + 1))

plt.plot(ft)
plt.title('Spectrum')
plt.xlabel('Frequency Bin')
plt.ylabel('Amplitude')
plt.show()

48.4. spectrogram

import numpy as np
import librosa
import matplotlib.pyplot as plt

# filepath = '/home/u2/h4/PycharmProjects/whisper/1670162239-2022-12-04-16_57.mp3'
filepath = '/mnt/hit4/hit4user/gitlabprojects/captcha_fssp/app/929014e341a0457f5a90a909b0a51c40.wav'

y, sr = librosa.load(filepath)

spec = np.abs(librosa.stft(y, hop_length=512))
spec = librosa.amplitude_to_db(spec, ref=np.max)
# fig, ax = plt.figure()
plt.imshow(spec, origin="lower", cmap=plt.get_cmap("magma"))

plt.colorbar(format='%+2.0f dB')
plt.title('Spectrogram')
plt.show()

48.5. log-Mel spectrogram

import numpy as np
import librosa
import matplotlib.pyplot as plt

# filepath = '/home/u2/h4/PycharmProjects/whisper/1670162239-2022-12-04-16_57.mp3'
filepath = '/mnt/hit4/hit4user/gitlabprojects/captcha_fssp/app/929014e341a0457f5a90a909b0a51c40.wav'

y, sr = librosa.load(filepath)

hop_length = 512
n_mels = 128 #  linear transformation matrix to project FFT bins
n_fft = 2048 #  samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz
# one line mel spectrogram
S = librosa.feature.melspectrogram(y, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels)
# 3 lines mel spectrogram
fft_windows = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
magnitude = np.abs(fft_windows)**2
mel = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels)
S2 = mel.dot(magnitude)
assert (S2 == S).all()

S = np.log10(S) # Log

mel_spect = librosa.power_to_db(S, ref=np.max)
plt.imshow(mel_spect, origin="lower", cmap=plt.get_cmap("magma"))

plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
plt.show()

48.6. distinguish emotions

male = librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13)
male = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13), axis=0)

49. Audio

49.1. terms

  • down-mixing - The process of combining multiple audio output channels into a single stereo or mono output
  • resampling - change sample rate, - samplese per seconds

49.2. theory

  • waveform - wave or oscilates curve with amplitude
  • frequency - occurrences of vibrations per unit of time
  • sampling frequency or sampling rate - average number of samples obtained in one second. or hertz e.g. 48 kHz is 48,000 samples per second. 44.1kHz, or 44,100 samples per second
  • Bit depth - typically recorded at 8-, 16-, and 24-bit depth,
    • mp3 does not have bit depth - compressed format
    • wav - uncompressed
  • quality 44.1kHz / 16-bit - CD, 192kHz/24-bit - hires audio
  • bit rate - bits per second required for encoding without compression

Calc bit rate and size:

  • 44.1kHz/16-bit: 44,100 x 16 x 2 = 1,411,200 bits per second (1.4Mbps)
  • 44.1kHz/16-bit: 1.4Mbps * 300s = 420Mb (52.5MB)

All wave forms

  • periodic
    • simple
    • comples
  • aperiodic
    • noise
    • pulse
  • amplitude - distance from max and min
  • wavelength - total distance covered by a particle in one time period
  • Phase - location of the wave from an equilibrium point as time t=0

features

  • loudness - brain intensity
  • pitch - brain frequency
  • quality or Timbre - brain ?
  • intensity
  • amplitude phase
  • angular velocity

49.3. The Fourier Transform (spectrum)

mathematical formula - converts the signal from the time domain into the frequency domain.

  • result - *spectrum
  • Fourier’s theorem - signal can be decomposed into a set of sine and cosine waves
  • fast Fourier transform (FFT) is an algorithm that can efficiently compute the Fourier transform
  • Short-time Fourier transform - signal in the time-frequency domain by computing discrete Fourier transforms (DFT) over short overlapping windows. for non periodic signals - such as music and speech

49.4. log-Mel spectrogram

spectrogram - the horizontal axis represents time, the vertical axis represents frequency, and the color intensity represents the amplitude of a frequency at a certain point.

  • y - Decibels
  • used to train convolutional neural networks for the classification

Mel-spectrogram converts the frequencies to the mel-scale is “a perceptual scale of pitches judged by listeners to be equal in distance from one another”

  • y - just Hz 0,64,128,256,512,1024
  • It uses the Mel Scale instead of Frequency on the y-axis.
  • It uses the Decibel Scale instead of Amplitude to indicate colors.
  • x - time sequence
  • value - mel shaped dB

Mel scale (after the word melody) - frequency(Hz) to mels(mel) conversion by formula

  • the pair at 100Hz and 200Hz will sound further apart than the pair at 1000Hz and 1100Hz.
  • you will hardly be able to distinguish between the pair at 10000Hz and 10100Hz.

Decibel Scale - *2

  • 10 dB is 10 times louder than 0 dB
  • 20 dB is 100 times louder than 10 dB

steps:

  1. Separate to windows: Sample the input with windows of size n_fft=2048, making hops of size hop_length=512 each time to sample the next window.
  2. Compute FFT (Fast Fourier Transform) for each window to transform from time domain to frequency domain.
  3. Generate a Mel scale: Take the entire frequency spectrum, and separate it into n_mels=128 evenly spaced frequencies.
  4. Generate Spectrogram: For each window, decompose the magnitude of the signal into its components, corresponding to the frequencies in the mel scale.

49.4.1. Log - because

  • np.log10(S) after mel spectrogram
  • or because Mel Scale has log in formule
 func frequencyToMel(_ frequency: Float) -> Float {
        return 2595 * log10(1 + (frequency / 700))
    }


    func melToFrequency(_ mel: Float) -> Float {
        return 700 * (pow(10, mel / 2595) - 1)
    }

49.5. pyo

49.6. torchaudio

50. Whisper

  • a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model
  • Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder
  • automatic speech recognition (ASR)
  • Whisper is pre-trained on a vast quantity of labelled audio-transcription data, 680,000 hours to be precise
  • 117,000 hours of this pre-training data is multilingual ASR data
  • supervised task of speech recognition
  • uses

logits - all 51865 tokes probability

Steps:

  1. model.transcribe
  2. model.decode
  3. DecodingTask.run()
  4. self._main_loop

50.1. Byte-Pair Encoding (BPE)

Tokenization algorithms can be

  • word
  • subword - used by most state-of-the-art NLP models - frequently used words should not be split into smaller subwords
  • character-based

Subword-based tokenization:

50.1.1. usage

from transformers import GPT2TokenizerFast
path = '/home/u2/.local/lib/python3.8/site-packages/whisper/assets/multilingual'

tokenizer = GPT2TokenizerFast.from_pretrained(path)

tokens = [[50364, 3450, 5505, 13, 50464, 51014, 9149, 11, 6035, 5345, 7520, 1595, 6885, 1725, 30162, 13, 51114, 51414, 21249, 7520, 9916, 13, 51464]]
print([tokenizer.decode(t).strip() for t in tokens])
print(tokenizer.encode('А вот. Да, но он уже у меня не работает. Нет уже нет.'))

50.2. model.transcribe(filepath or numpy)

  • mel = log_mel_spectrogram(audio) # split audio by chunks (84)
    • whisper.audio.load_audio(filepath)
  • if no language set - it will use 30 seconds to detect language first
  • loop seek<length
    • get 3000 frames - 30 seconds
    • decode segment - DecodingResult=DecodingTask(model, options).run(mel) decoding.py (701) see 50.3
    • if no speech then skip
    • split segment to consequtives
  • tokenize and segment
  • summarize

  • segments - think a chunk of speech based you obtain from the timestamps. Something like 10:00s -> 13.52s would be a segment

50.2.1. return

  • text - full text
  • segments
    • seek
    • start&end
    • text - segment text
    • 'tokens': []
    • 'temperature': 0.0,
    • 'avg_logprob': -0.7076873779296875, # if < -1 - too low probability, retranscribe with another temperature
    • 'compression_ratio': 1.1604938271604939,
    • 'no_speech_prob': 0.5063244700431824 - если больше 0.6, то не возвращаем сегмент
  • 'language': 'ru'

{'text': 'long text', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 2.64, 'text': ' А вот, не добрый день.', 'tokens': [], 'temperature': 0.0, 'avg_logprob': -0.7076873779296875, 'compression_ratio': 1.1604938271604939, 'no_speech_prob': 0.5063244700431824}, {'id': 1, 'seek': 0, 'start': 2.64, 'end': 4.64, 'text': ' Меня зовут Дмитрий, это Русснорбанг.', 'tokens': [], 'temperature': 0.0, 'avg_logprob': -0.7076873779296875, 'compression_ratio': 1.1604938271604939, 'no_speech_prob': 0.5063244700431824}, {'id': 2, 'seek': 0, 'start': 4.64, 'end': 8.040000000000001, 'text': ' Дайте, он разжонили по поводу Мехеэлы Романовича Гапуэк,', 'tokens': [], 'temperature': 0.0, 'avg_logprob': -0.7076873779296875, 'compression_ratio': 1.1604938271604939, 'no_speech_prob': 0.5063244700431824},

{'id': 62, 'seek': 13828, 'start': 150.28, 'end': 151.28, 'text': ' Если…', 'tokens': [], 'temperature': 0.0, 'avg_logprob': -0.3628227009492762, 'compression_ratio': 1.0274509803921568, 'no_speech_prob': 1.6432641132269055e-05}, {'id': 63, 'seek': 13828, 'start': 151.28, 'end': 154.28, 'text': ' Если как-то пежись, хорошо, накрыли.', 'tokens': [], 'temperature': 0.0, 'avg_logprob': -0.3628227009492762, 'compression_ratio': 1.0274509803921568, 'no_speech_prob': 1.6432641132269055e-05}, {'id': 64, 'seek': 15428, 'start': 154.28, 'end': 183.28, 'text': ' Ну, да, всего доброго, до сих пор.', 'tokens': [50364, 7571, 11, 8995, 11, 15520, 35620, 2350, 11, 5865, 776, 4165, 11948, 13, 51814], 'temperature': 0.0, 'avg_logprob': -0.9855107069015503, 'compression_ratio': 0.576271186440678, 'no_speech_prob': 6.223811215022579e-05}], 'language': 'ru'}

50.3. model.decode(mel, options)

options: language

DecodingTask(model, options).run(mel)

  • create GPT2TokenizerFast wrapped
  • audio_features <- mel
  • tokens, sum_logprobs, no_speech_probs <- audio_features
  • texts: List[str] = [tokenizer.decode(t).strip() for t in tokens]
    • tokens = [ [50364, 3450, 5505, 13, 50464, 51014, 9149, 11, 6035, 5345, 7520, 1595, 6885, 1725, 30162, 13, 51114, 51414, 21249, 7520, 9916, 13, 51464] ]
  • <- fine tune

https://huggingface.co/blog/fine-tune-whisper https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz

50.4. no_speech_prob and avg_logprob

  • no_speech_prob - calc at the first toke only and at SOT logit
  • avg_logprob
    • sum_logprobs - sum of:
      • current_logprobs - logprobs = F.log_softmax(logits.float(), dim=-1)

50.5. decode from whisper_word_level 844

decode_word_level 781

  • result, ts = decode.run() 711 - decoding.py 612
  • finalize 524 - decoding.py 271

self.ts

  • self.decoder.update_with_ts 700 (main_loop) - decoding.py 602

50.6. main_loop

receive

  • audio_features
  • tokens with 3 values

tokes: int +=1 complete: bool = False sum_logprobs: int

50.7. words timestemps https://github.com/jianfch/stable-ts

timestamp_logits - ts_logits - self.ts -

50.7.1. transcribe format

  • segments:

[{'id': 0, 'seek': 0, 'offset': 0.0, 'start': 1.0, 'end': 3.0, 'text': ' А вот, не добрый день.', 'tokens': [50414, 3450, 5505, 11, 1725, 35620, 4851, 13509, 13, 50514, 50514, 47311, 46376, 3401, 919, 1635, 50161, 11, 2691, 6325, 7071, 461, 1234, 481, 1552, 1416, 1906, 13, 50564, 50564, 3401, 10330, 11, 5345, 4203, 1820, 1784, 5435, 2801, 10499, 35749, 50150, 386, 2338, 6325, 1253, 11114, 3903, 386, 7247, 4219, 23412, 3605, 13, 50714, 50714, 3200, 585, 37408, 585, 11, 2143, 10655, 30162, 1006, 17724, 15028, 4558, 13, 50814, 50814, 2348, 1069, 755, 12886, 387, 29868, 11, 776, 31158, 50233, 19411, 23201, 860, 1283, 25190, 13, 51014, 51014, 9149, 11, 6035, 5345, 7520, 1595, 6885, 1725, 30162, 13, 51064, 51064, 3450, 5505, 5865, 10751, 29117, 21235, 13640, 11, 2143, 5345, 1595, 10655, 2801, 7247, 9223, 24665, 30162, 13, 51314, 51314, 6684, 1725, 13790, 13549, 10986, 11, 6035, 8995, 11, 6035, 4777, 1725, 485, 51414, 51414, 21249, 7520, 9916, 13, 51464, 51464, 4857, 37975, 11, 25969, 5878, 11, 3014, 50150, 386, 2338, 6325, 1253, 11114, 3903, 1595, 6519, 3348, 35968, 23412, 34005, 47573, 51664, 51664, 10969, 45309, 13388, 19465, 5332, 4396, 20392, 44356, 740, 1069, 755, 1234, 1814, 13254, 11, 51814, 51814], 'temperature': 0.0, 'avg_logprob': -0.5410955043438354, 'compression_ratio': 1.1496259351620948, 'no_speech_prob': 0.5069490671157837, 'alt_start_timestamps': [1.0, 0.9199999570846558, 1.0399999618530273, 0.9599999785423279, 1.100000023841858, 0.9399999976158142, 0.9799999594688416, 1.0799999237060547, 1.1200000047683716, 1.1999999284744263], 'start_ts_logits': [13.0390625, 12.4140625, 12.296875, 12.2109375, 12.171875, 12.140625, 12.0390625, 11.9921875, 11.9453125, 11.8046875], 'alt_end_timestamps': [3.0, 2.0, 2.859999895095825, 2.879999876022339, 2.8999998569488525, 4.0, 2.9800000190734863, 3.0399999618530273, 2.299999952316284, 2.359999895095825], 'end_ts_logits': [9.6015625, 8.9375, 7.65234375, 7.53125, 7.4609375, 7.4609375, 7.30859375, 7.28515625, 7.22265625, 7.11328125], 'unstable_word_timestamps': [{'word': ' А', 'token': 3450, 'timestamps':[7.0, 29.5, 1.0, 29.35999870300293, 13.0, 29.279998779296875, 29.34000015258789, 29.479999542236328, 28.939998626708984, 29.01999855041504], 'timestamp_logits': [15.1328125, 15.0703125, 14.9921875, 14.96875, 14.96875, 14.96875, 14.890625, 14.8359375, 14.7890625, 14.7890625]}, {'word': ' вот', 'token': 5505, 'timestamps': [27.34000015258789, 29.31999969482422, 26.979999542236328, 28.420000076293945, 28.739999771118164, 27.31999969482422, 28.439998626708984, 29.34000015258789, 13.519999504089355, 28.239999771118164], 'timestamp_logits': [19.546875, 19.46875, 19.296875, 19.125, 19.109375, 19.109375, 19.09375, 19.09375, 19.078125, 19.046875]}, {'word': ',', 'token': 11, 'timestamps': [2.0, 3.0, 4.0, 1.0, 1.7999999523162842, 10.0, 3.0199999809265137, 1.7599999904632568, 19.0, 3.5], 'timestamp_logits': [14.8828125, 13.640625, 13.21875, 12.734375, 11.3828125, 11.3671875, 11.3515625, 11.3359375, 11.2890625, 11.2578125]}, {'word': ' не', 'token': 1725, 'timestamps': [2.0, 1.0, 1.7599999904632568, 1.71999990940094, 1.6399999856948853, 1.7799999713897705, 28.19999885559082, 1.7999999523162842, 7.0, 28.239999771118164], 'timestamp_logits': [15.328125, 15.03125, 14.921875, 14.4453125, 14.3671875, 14.234375, 14.2265625, 14.203125, 14.0234375, 13.875]}, {'word': ' добр', 'token': 35620, 'timestamps': [28.099998474121094, 28.139999389648438, 14.75999927520752, 14.920000076293945, 27.099998474121094, 18.119998931884766, 14.59999942779541, 28.260000228881836, 13.0, 26.599998474121094], 'timestamp_logits': [14.015625, 13.9765625, 13.96875, 13.8515625, 13.84375, 13.8046875, 13.7109375, 13.7109375, 13.6953125, 13.6953125]}, {'word': 'ый', 'token': 4851, 'timestamps': [13.59999942779541, 15.399999618530273, 13.279999732971191, 14.719999313354492, 13.399999618530273, 14.880000114440918, 13.0, 14.59999942779541, 13.679999351501465, 13.639999389648438], 'timestamp_logits': [15.4140625, 15.28125, 15.21875, 14.765625, 14.7265625, 14.71875, 14.6328125, 14.578125, 14.5546875, 14.53125]}, {'word': ' день', 'token': 13509, 'timestamps': [2.0, 20.959999084472656, 3.0, 25.68000030517578, 3.4800000190734863, 24.0, 3.5, 19.920000076293945, 28.559999465942383, 4.0], 'timestamp_logits': [9.3984375, 9.21875, 9.046875, 9.015625, 8.9296875, 8.90625, 8.875, 8.8203125, 8.7890625, 8.7421875]}, {'word': '.', 'token': 13, 'timestamps': [3.0, 2.0, 4.0, 3.5, 3.0199999809265137, 2.879999876022339, 3.319999933242798, 3.0399999618530273, 2.299999952316284, 2.859999895095825], 'timestamp_logits': [12.6328125, 12.4296875, 10.875, 10.2578125, 9.828125, 9.5078125, 9.4921875, 9.421875, 9.3828125, 9.3046875]} ], 'anchor_point': False, 'word_timestamps': [{'word': ' А', 'token': 3450, 'timestamp': 1.0}, {'word': ' вот', 'token': 5505, 'timestamp': 1.0}, {'word': ',', 'token': 11, 'timestamp': 2.0}, {'word': ' не', 'token': 1725, 'timestamp': 2.0}, {'word': ' добр', 'token': 35620, 'timestamp': 2.0}, {'word': 'ый', 'token': 4851, 'timestamp': 2.0}, {'word': ' день', 'token': 13509, 'timestamp': 2.0}, {'word': '.', 'token': 13, 'timestamp': 3.0}], 'whole_word_timestamps': [{'word': ' А', 'timestamp': 1.3799999952316284}, {'word': ' вот,', 'timestamp': 1.7599999904632568}, {'word': ' не', 'timestamp': 1.7899999618530273}, {'word': ' добр', 'timestamp': 1.8899999856948853}, {'word': 'ый', 'timestamp': 1.8899999856948853}, {'word': ' день.', 'timestamp': 2.5899999141693115} ] }, {'id': 1,

50.8. confidence score

sum_logprobs: List[float] = [lp[i] for i, lp in zip(selected, sum_logprobs)]

avg_logprob - [lp / (len(t) + 1) for t, lp in zip(tokens, sum_logprobs)]

path

  • model.transcribe
  • model.decode
  • transcribe_word_level (whisper_word_level.py:39)
  • results, ts_tokens, ts_logits_ = model.decode

50.9. TODO main/notebooks

51. NER USΕ CASES

51.1. Spelling correction algorithms or (spell checker) or (comparing a word to a list of words)

Damerau-Levenshtein - edit distance with constant time O(1) - independent of the word list size (but depending on the average term length and maximum edit distance)

51.2. fuzzy string comparision или Приближённый поиск

approaches:

  • Levenshtein is O(m*n) - mn - length of the two input strings
  • difflib.SequenceMatcher
    • uses the Ratcliff/Obershelp algorithm - O(n*2)
  • расстояние Хэмминга - не учитывает удаление символов, а считает только для двух строк одинаковой длины количество символов

databases

52. Flax and Jax

Google

Flax - neural network library and ecosystem for JAX designed for flexibility

53. hyperparemeter optimization library test-tube

54. Keras

MIT нейросетевая библиотека

  • надстройку над фреймворками Deeplearning4j, TensorFlow и Theano
  • Нацелена на оперативную работу с сетями глубинного обучения
  • компактной, модульной и расширяемой
  • высокоуровневый, более интуитивный набор абстракций, который делает простым формирование нейронных сетей,
  • channels_last - default for keras python-ds#MissingReference

import logging logging.getLogger('tensorflow').disabled = True

  • loss - loss function https://github.com/keras-team/keras/blob/c2e36f369b411ad1d0a40ac096fe35f73b9dffd3/keras/metrics.py
    • mean_squared_error
    • categorical_crossentropy
    • binary_crossentropy
    • sparse_categorical_accuracy - Calculates the top-k categorical accuracy rate, i.e. success when the target class is within the top-k predictions provided.
    • top_k_categorical_accuracy - Calculates the top-k categorical accuracy rate, i.e. success when the target class is within the top-k predictions provided.
    • sparse_top_k_categorical_accuracy

Steps:

# 1.declare keras.layers.Input and keras.layers.Dense in chain
# 2.
model = Model(inputs=inputs, outputs=predictions) # where inputs - inputs, predictions - last Dense layout
# 3. Configures the model for training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) #
# 4.
model.fit(data, labels, epochs=10, batch_size=32)
# 5.
model.predict(np.array([[3,3,3]])) - shape (3,)

model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

54.1. install

pip install keras –user

54.2. API types

  • Model subclassing: from keras.models import Model
  • Model constructor - deprecated
  • Functional API
  • Sequential model

54.3. Sequential model

  • first layer needs to receive information about its input shape - following layers can do automatic shape

inference

54.4. functional API

54.5. Layers

  • layer.get_weights()
  • layer.get_config(): returns a dictionary containing the configuration of the layer.

54.5.1. types

  • Input - instantiate a Keras tensor Input(shape=(784,)) - indicates that the expected input will be batches of 784-dimensional vectors
  • Dense - Each neuron recieves input from all the neurons in the previous layer
  • Embedding - can only be used as the first layer
  • Merge Layers - concatenate - Add - Substract - Multiply - Average etc.

54.5.2. Dense

  • output = activation(dot(input, kernel) + bias)

54.6. Models

attributes:

  • model.layers is a flattened list of the layers comprising the model.
  • model.inputs is the list of input tensors of the model.
  • model.outputs is the list of output tensors of the model.
  • model.summary() prints a summary representation of your model. Shortcut for
  • model.get_config() returns a dictionary containing the configuration of the model.

54.7. Accuracy:

# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.99794011611938471

# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98999999999999999

54.8. input shape & text prepare

import numpy as np
data = np.random.random((2, 3)) # ndarray [[1,1,1],[1,1,1]]
print(data.shape) # = (2,3)

(2,)

data = np.random.random((2,)) # [0.3907832  0.00941261]

list to ndarray

np.array(texts)
np.asarray(texts)

fit of batches

model.fit([np.asarray([x_embed , x_embed]) , np.asarray([x2_onehot, x2_onehot])], np.asarray([y_onehot[0], y_onehot[0]]), epochs=2, batch_size=2)

54.9. ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape

if

Input(shape=(5,100))

then

model.fit(x_embed, y_onehot, epochs=3, batch_size=1)

where x_embed.shape = (1, 5, 100)

54.10. merge inputs

https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/

Добавил ещё один Input(shape=(x2_size,) в виде вектора и сделал concatenate на плоском слое нейронов, важно чтобы shape были одной размерности в данном случае это вектор

    inp = Input(shape=(words, embedding_size))  # 5 tokens
    output = inp
    #my
    #word_input = Input(shape=(x2_size,), name='word_input')

    outputs = []
    for i in range(len(kernel_sizes_cnn)):
        output_i = Conv1D(filters_cnn, kernel_size=kernel_sizes_cnn[i],
                          activation=None,
                          kernel_regularizer=l2(coef_reg_cnn),
                          padding='same')(output)
        output_i = BatchNormalization()(output_i)
        output_i = Activation('relu')(output_i)
        output_i = GlobalMaxPooling1D()(output_i)
        outputs.append(output_i)

    output = concatenate(outputs, axis=1)
    #my
    output = concatenate([output, word_input]) #second input

    output = Dropout(rate=dropout_rate)(output)
    output = Dense(dense_size, activation=None,
                   kernel_regularizer=l2(coef_reg_den))(output)

    output = BatchNormalization()(output)
    output = Activation('relu')(output)
    output = Dropout(rate=dropout_rate)(output)
    output = Dense(n_classes, activation=None,
                   kernel_regularizer=l2(coef_reg_den))(output)
    output = BatchNormalization()(output)
    act_output = Activation("softmax")(output)
    model = Model(inputs=[inp, word_input], outputs=act_output)

model: Model = build_model(vocab_y.len, embedder.dim, words, embedder.dim)
model.fit([np.asarray(x), np.asarray(x2)], np.asarray(y), epochs=100, batch_size=2)

54.11. convolution

  • filters - dimensionality of the output space - In practice, they are in number of 64,128,256, 512 etc.
  • kernel_size is size of these convolution filters - sliding window. In practice they are 3x3, 1x1 or 5x5
  • Note that number of filters from previous layer become the number of channels for current layer's input image.

54.13. Early stopping

https://keras.io/callbacks/

from tensorflow.keras.callbacks import EarlyStopping
early_stopping_callback = EarlyStopping(monitor='val_acc', patience=2)
model.fit(X_train, Y_train, callbacks=[early_stopping_callback])
from keras.callbacks import EarlyStopping
# ...
num_epochs = 50 # we iterate at most fifty times over the entire training set
# ...
# fit the model on the batches generated by datagen.flow()---most parameters similar to model.fit
model.fit_generator(datagen.flow(X_train, Y_train,
                        batch_size=batch_size),
                        samples_per_epoch=X_train.shape[0],
                        nb_epoch=num_epochs,
                        validation_data=(X_val, Y_val),
                        verbose=1,
                        callbacks=[EarlyStopping(monitor='val_loss', patience=5)]) # adding early stopping

54.14. plot history

history = model.fit(X_train, Y_train, validation_split=0.2)
plt.plot(history.history['acc'],
         label='Доля верных ответов на обучающем наборе')
plt.plot(history.history['val_acc'],
         label='Доля верных ответов на проверочном наборе')
plt.xlabel('Эпоха обучения')
plt.ylabel('Доля верных ответов')
plt.legend()
plt.show()

54.15. ImageDataGenerator class

datagen = ImageDataGenerator(
#         zoom_range=0.2, # randomly zoom into images
#         rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=False)  # randomly flip images

54.16. CNN Rotate

54.17. LSTM

https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/ By default the Keras implementation resets the network state after each training batch.

model.add(LSTM(50, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True))
model.reset_states() # at the end of epoch

55. Tesseract - Optical Character Recognition

55.1. compilation

dockerfile:

RUN apt-get update && apt-get install -y --no-install-recommends \
  g++ \
  automake \
  make \
  libtool \
  pkg-config \
  libleptonica-dev \
  curl \
  libpng-dev \
  zlib1g-dev \
  libjpeg-dev \
  && apt-get autoclean \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

ARG PREFIX=/usr/local
ARG VERSION=4.1.0

RUN curl --silent --location --location-trusted \
        --remote-name https://github.com/tesseract-ocr/tesseract/archive/$VERSION.tar.gz \
  && tar -xzf $VERSION.tar.gz \
  && cd tesseract-$VERSION \
  && ./autogen.sh \
  && ./configure --prefix=$PREFIX \
  && make \
  && make install \
  && ldconfig

55.2. black and white list

https://github.com/tesseract-ocr/langdata/blob/master/rus/rus.training_text

  • ./tesseract -l eng /home/u2/Documents/2.jpg stdout -c tessedit_char_blacklist='0123456789'
  • ./tesseract -l eng /home/u2/Documents/2.jpg stdout -c tessedit_char_whitelist='0123456789'
print(pytesseract.image_to_string(im, lang='rus', config='-c tessedit_char_whitelist=0123456789'))

55.3. notes

when we repeat symbol it start to recognize it

55.4. prepare

55.5. usage

text = pytesseract.image_to_string(img, lang='rus')

letters = pytesseract.image_to_boxes(img, lang='rus')
letters = letters.split('\n')
letters = [letter.split() for letter in letters]
h, w = img.shape
for letter in letters:
     cv.rectangle(img, (int(letter[1]), h - int(letter[2])), (int(letter[3]), h - int(letter[4])), (0, 0, 255), 2)
            p_x = int(letter[1])
            p_y = hh - int(letter[2])  # 0 at top - LOWER
            p_x2 = int(letter[3])
            p_y2 = hh - int(letter[4])  # 0 at top - close to 0 - higher y2 < y

            # cv.rectangle(img, (int(letter[1]), h - int(letter[2])), (int(letter[3]), h - int(letter[4])), (0, 0, 255),
            #              2)

            cc = [
                [p_x, p_y],
                [p_x2, p_y],  # _
                [p_x2, p_y2],  # _|
                [p_x, p_y2]]

            c = np.array(cc, dtype=np.int32)

            # print(cv.contourArea(c), ',')

            # print(cc)
            # cv.drawMarker(img, (int(letter[1]), hh - int(letter[2])), -1, (0, 255, 0), 3)

            x = p_x
            y = p_y2
            w = p_x2 - p_x
            h = p_y - p_y2
            box = [x, y, w, h]

56. FEATURE ENGEERING

56.1. Featuretools - Aturomatic Feature Engeering

Limitation: intended to be run on datasets that can fit in memory on one machine

  • делить закачку по строкам и делать массив
  • закачивать часть по дате

Steps:

  1. create dict {column:[rows], column2:[rows]}
  2. EntitySet
    • Entities pd.DataFrame
    • Relations
      • one to one only - for many to many you must create middle set(ids)
      • for each child id parent id MUST EXIST
      • child id and parent id type must be queal
  3. ft.dfs - Input - entities with relationships

Cons

  • мусорные столбцы построенные на id столбцах и в порядке от child к parent при many-to-many

for prediction you must have в 10 раз больше строк чем feature https://www.youtube.com/watch?v=Dc0sr0kdBVI&hd=1#t=57m20s

56.1.1. variable types

56.1.2. example one-to-many

# sys.partner_id - foreign key
# partner - one
# sys - many
entities = {
  "sys": (sys, "id"),
  "partner": (partner, "id)
}
relationships = {
  ("partner", "id", "sys", "partner_id")
}
# fields:
# partner.SUM(sys.field1)


56.1.3. example many-to-many

entities = {
  "sys": (sys, "id"),
  "cl_ids": (cl_ids, "id"),
  "cl_budget": (cl_budget, "idp")
}
relationships = {
  ("cl_ids", "id", "sys", "client_id"),
  ("cl_ids", "id", "cl_budget", "id")
}

# cl_ids.SUM(cl_budget.field1)
# cl_ids.SUM(sys.field1) - мусорное поле, дублирующиее sys.field1

56.1.4. oparations

ft.list_primitives().head(5)

56.1.5. aggregation primitive - across a parent-child relationship:

Default: [“sum”, “std”, “max”, “skew”, “min”, “mean”, “count”, “percent_true”, “num_unique”, “mode”]

skew
Computes the extent to which a distribution differs from a normal distribution.
std
Computes the dispersion relative to the mean value, ignoring `NaN`.
percent_true
Determines the percent of `True` values.
mode
Determines the most commonly repeated value.
  1. all

    0 std aggregation Computes the dispersion relative to the mean value, ignoring `NaN`. 1 median aggregation Determines the middlemost number in a list of values. 2 n_most_common aggregation Determines the `n` most common elements. 3 num_true aggregation Counts the number of `True` values. 4 time_since_last aggregation Calculates the time elapsed since the last datetime (default in seconds). 5 max aggregation Calculates the highest value, ignoring `NaN` values. 6 entropy aggregation Calculates the entropy for a categorical variable 7 any aggregation Determines if any value is 'True' in a list. 8 mode aggregation Determines the most commonly repeated value. 9 time_since_first aggregation Calculates the time elapsed since the first datetime (in seconds). 10 trend aggregation Calculates the trend of a variable over time. 11 first aggregation Determines the first value in a list. 12 sum aggregation Calculates the total addition, ignoring `NaN`. 13 count aggregation Determines the total number of values, excluding `NaN`. 14 skew aggregation Computes the extent to which a distribution differs from a normal distribution. 15 avg_time_between aggregation Computes the average number of seconds between consecutive events. 16 percent_true aggregation Determines the percent of `True` values. 17 num_unique aggregation Determines the number of distinct values, ignoring `NaN` values. 18 all aggregation Calculates if all values are 'True' in a list. 19 min aggregation Calculates the smallest value, ignoring `NaN` values. 20 last aggregation Determines the last value in a list. 21 mean aggregation Computes the average for a list of values.

56.1.6. TransformPrimitive - one or more variables from an entity to one new:

Default: [“day”, “year”, “month”, “weekday”, “haversine”, “num_words”, “num_characters”]

Useful:

  • divide_numeric - ratio

Transform Don't have:

  • root
  • square_root
  • log
  1. all
    • https://docs.featuretools.com/en/stable/_modules/featuretools/primitives/standard/binary_transform.html
    • 22 year transform Determines the year value of a datetime.
    • 23 equal transform Determines if values in one list are equal to another list.
    • 24 isin transform Determines whether a value is present in a provided list.
    • 25 num_characters transform Calculates the number of characters in a string.
    • 26 less_than_scalar transform Determines if values are less than a given scalar.
    • 27 less_than_equal_to transform Determines if values in one list are less than or equal to another list.
    • 28 multiply_boolean transform Element-wise multiplication of two lists of boolean values.
    • 29 week transform Determines the week of the year from a datetime.
    • 30 greater_than_equal_to_scalar transform Determines if values are greater than or equal to a given scalar.
    • 31 and transform Element-wise logical AND of two lists.
    • 32 multiply_numeric transform Element-wise multiplication of two lists.
    • 33 second transform Determines the seconds value of a datetime.
    • 34 not_equal transform Determines if values in one list are not equal to another list.
    • 35 day transform Determines the day of the month from a datetime.
    • 36 cum_min transform Calculates the cumulative minimum.
    • 37 greater_than_scalar transform Determines if values are greater than a given scalar.
    • 38 modulo_numeric_scalar transform Return the modulo of each element in the list by a scalar.
    • 39 subtract_numeric_scalar transform Subtract a scalar from each element in the list.
    • 40 absolute transform Computes the absolute value of a number.
    • 41 add_numeric_scalar transform Add a scalar to each value in the list.
    • 42 cum_count transform Calculates the cumulative count.
    • 43 divide_by_feature transform Divide a scalar by each value in the list.
    • 44 divide_numeric_scalar transform Divide each element in the list by a scalar.
    • 45 time_since_previous transform Compute the time since the previous entry in a list.
    • 46 longitude transform Returns the second tuple value in a list of LatLong tuples.
    • 47 cum_max transform Calculates the cumulative maximum.
    • 48 not transform Negates a boolean value.
    • 49 not_equal_scalar transform Determines if values in a list are not equal to a given scalar.
    • 50 diff transform Compute the difference between the value in a list and the
    • 51 equal_scalar transform Determines if values in a list are equal to a given scalar.
    • 52 num_words transform Determines the number of words in a string by counting the spaces.
    • 53 divide_numeric transform Element-wise division of two lists.
    • 54 less_than_equal_to_scalar transform Determines if values are less than or equal to a given scalar.
    • 55 month transform Determines the month value of a datetime.
    • 56 or transform Element-wise logical OR of two lists.
    • 57 weekday transform Determines the day of the week from a datetime.
    • 58 less_than transform Determines if values in one list are less than another list.
    • 59 minute transform Determines the minutes value of a datetime.
    • 60 multiply_numeric_scalar transform Multiply each element in the list by a scalar.
    • 61 greater_than_equal_to transform Determines if values in one list are greater than or equal to another list.
    • 62 hour transform Determines the hour value of a datetime.
    • 63 modulo_by_feature transform Return the modulo of a scalar by each element in the list.
    • 64 scalar_subtract_numeric_feature transform Subtract each value in the list from a given scalar.
    • 65 is_weekend transform Determines if a date falls on a weekend.
    • 66 greater_than transform Determines if values in one list are greater than another list.
    • 67 cum_mean transform Calculates the cumulative mean.
    • 68 modulo_numeric transform Element-wise modulo of two lists.
    • 69 subtract_numeric transform Element-wise subtraction of two lists.
    • 70 haversine transform Calculates the approximate haversine distance between two LatLong
    • 71 is_null transform Determines if a value is null.
    • 72 add_numeric transform Element-wise addition of two lists.
    • 73 cum_sum transform Calculates the cumulative sum.
    • 74 percentile transform Determines the percentile rank for each value in a list.
    • 75 time_since transform Calculates time from a value to a specified cutoff datetime.
    • 76 latitude transform Returns the first tuple value in a list of LatLong tuples.
    • 77 negate transform Negates a numeric value.

56.1.7. create primitive

from featuretools.primitives import make_trans_primitive
from featuretools.variable_types import Numeric
# Create two new functions for our two new primitives
def Log(column):
    return np.log(column)
def Square_Root(column):
    return np.sqrt(column)
# Create the primitives
log_prim = make_trans_primitive(
    function=Log, input_types=[Numeric], return_type=Numeric)
square_root_prim = make_trans_primitive(
    function=Square_Root, input_types=[Numeric], return_type=Numeric)

56.1.8. EXAMPLE from pandas

es = ft.EntitySet()
matches_df = pd.read_csv("./matches.csv")
es.entity_from_dataframe(entity_id="matches",
                         index="match_id",
                         time_index="match_date",
                         dataframe=matches_df)

56.4. TSFRESH (time sequence)

56.5. ATgfe - new feature

57. support libraries

dask
scale numpy, pandas, scikit-learn, XGBoost
(no term)
tqdm - progress meter for loops: for i in tqdm(range(1000)):
(no term)
msgpack - binary serialization of JSON for example
(no term)
cloudpickle - serialize to "pickle" lambda and classes
(no term)
tornado - non-blocking network I/O
(no term)
BeautifulSoup - extract data for web html pages

58. Microsoft nni AutoML framework (stupid shut)

59. transformers - provides pretrained models

pip3 install transformers==4.24.0 --user
/usr/lib/python3/dist-packages/secretstorage/dhcrypto.py:15: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
/usr/lib/python3/dist-packages/secretstorage/util.py:19: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
Collecting transformers==4.24.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 349.8 kB/s eta 0:00:00
Requirement already satisfied: tqdm>=4.27 in ./.local/lib/python3.8/site-packages (from transformers==4.24.0) (4.48.2)
Requirement already satisfied: packaging>=20.0 in ./.local/lib/python3.8/site-packages (from transformers==4.24.0) (22.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in ./.local/lib/python3.8/site-packages (from transformers==4.24.0) (0.12.1)
Requirement already satisfied: requests in ./.local/lib/python3.8/site-packages (from transformers==4.24.0) (2.28.1)
Requirement already satisfied: numpy>=1.17 in ./.local/lib/python3.8/site-packages (from transformers==4.24.0) (1.24.0)
Requirement already satisfied: filelock in ./.local/lib/python3.8/site-packages (from transformers==4.24.0) (3.0.12)
Requirement already satisfied: huggingface-hub<1.0,>=0.10.0 in ./.local/lib/python3.8/site-packages (from transformers==4.24.0) (0.10.0)
Requirement already satisfied: regex!=2019.12.17 in ./.local/lib/python3.8/site-packages (from transformers==4.24.0) (2022.9.13)
Requirement already satisfied: pyyaml>=5.1 in ./.local/lib/python3.8/site-packages (from transformers==4.24.0) (5.4.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in ./.local/lib/python3.8/site-packages (from huggingface-hub<1.0,>=0.10.0->transformers==4.24.0) (4.4.0)
Requirement already satisfied: idna<4,>=2.5 in ./.local/lib/python3.8/site-packages (from requests->transformers==4.24.0) (3.4)
Requirement already satisfied: charset-normalizer<3,>=2 in ./.local/lib/python3.8/site-packages (from requests->transformers==4.24.0) (2.1.1)
Requirement already satisfied: certifi>=2017.4.17 in ./.local/lib/python3.8/site-packages (from requests->transformers==4.24.0) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./.local/lib/python3.8/site-packages (from requests->transformers==4.24.0) (1.26.13)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.22.2
    Uninstalling transformers-4.22.2:
      Successfully uninstalled transformers-4.22.2
Successfully installed transformers-4.24.0

60. help

60.1. build-in help

  1. help(L.append) - docstr and many things
  2. dir() or dir(object) - list of all the globals and locals.
  3. locals() variables, and their values (called inside method)
  4. globals() method returns all the global variables, and their values, in a dictionary

61. IDE

By default, Python source files are treated as encoded in UTF-8 to change it:

#!/usr/bin/env python3
# - '*' - coding: cp1252 -*-

https://en.wikipedia.org/wiki/Comparison_of_integrated_development_environments#Python

61.1. EPL

py.exe or python.exe file [arg]

  • Exit - Control-D on Unix, Control-Z on Windows. - quit();
  • blank line; this is used to end a multi-line command.

61.2. PyDev is a Python IDE for Eclipse

  • Cltr+Space
  • F3 go to definition Alt+Arrow < >
  • Shift+Enter - next line
  • Ctrl+1 assign paramenters to field, create class constructor
  • Ctrl+2/R - rename varible
  • Alt+Shift+R rename verible
  • Alt+Shift+A Start/Stop Rectangular editing
  • Ctrl+F9 run test
  • Ctrl+F11 rerun last launch
  • Ctrl+Alt+Down/Up duplicate line
  • Alt+Shift+L Extract local varible
  • Alt+Shift+R Extract method

Firest

  1. Create Project
  2. Create new Source Folder - "src" http://www.pydev.org/manual_101_project_conf2.html

61.2.1. features

  • Django integration
  • Code completion
  • Code completion with auto import
  • Type hinting
  • Code analysis
  • Go to definition
  • Refactoring
  • Debugger
  • Remote debugger
  • Find Referrers in Debugger
  • Tokens browser
  • Interactive console
  • Unittest integration
  • Code coverage
  • PyLint integration
  • Find References (Ctrl+Shift+G)

61.3. Emacs

M-~ menu

61.3.1. python in org mode

https://stackoverflow.com/questions/18598870/emacs-org-mode-executing-simple-python-code

C-c C-c - to activate

1+1
print(1+1)

.emacs configuration:

;; enable python for in-buffer evaluation
(org-babel-do-load-languages
 'org-babel-load-languages
 '((python . t)))

;; all python code be safe
(defun my-org-confirm-babel-evaluate (lang body)
(not (string= lang "python")))
(setq org-confirm-babel-evaluate 'my-org-confirm-babel-evaluate)

;; required
(setq shell-command-switch "-ic")

61.3.2. Emacs

https://habr.com/ru/post/303600/

.emacs.d/lisp


61.4. PyCharm

61.4.1. installation:

  • Other settings -> settings for new project -> Tools -> Python integrated tools -> docstrings - reStructuredText
  • Ctrl+Alt+S -> keymap - Emacs

navigate

  • Ctrl+Alt+S -> keymap - up -> Ctrl+k
  • Ctrl+Alt+S -> keymap - left -> Ctrl+l
  • Ctrl+Alt+S -> keymap - move catet to previous word -> Alt+l

other:

  • Ctrl+Alt+S -> keymap - Error Description -> add key Alt+Z
  • Ctrl+Alt+S -> keymap - Navigate; Back -> add key Ctrl+\
  • Ctrl+Alt+S -> keymap - Select next tab -> Alt+E
  • Ctrl+Alt+S -> keymap - Select previous tab -> Alt+A
  • Ctrl+Alt+S -> keymap - Close tab -> Ctrl+Alt+w
  • Ctrl+Alt+S -> keymap - Backspace -> Ctrl+h
  • Ctrl+Alt+S -> keymap - Delete to word start -> Alt+h
  • Ctrl+Alt+S -> keymap - run/ -> Ctrl+C Ctrl+C
  • Ctrl+Alt+S -> keymap - back (Navigate) -> Alt+,

Disable cursor blinking: Ctrl+Alt+s -> Editor, General, Appearance

61.4.2. keys

  • Alt+\ - main menu
  • Alt+Shift+F10 - run
  • Alt+Shift+F8 - debug
  • Ctrl+Shift+U to upper case
  • Ctrl+. fold/unfold
  • Ctrl+q get documentation
  • Ctrl+Alt+q auto-indent lines
  • Ctrl+z/v scroll
  • Alt+left/right switch tabs
  • Ctrl+x k close tab
  • Ctrl+x ` go to next error
  • Alt+. go to declaration
  • Ctr+Shift+' maximize bottom console

emacs keymap

  • Alt+Shift+F10 run
  • Alt+; - comment text
  • leftAlt+ arrows - tabs switch
  • leftAlt+Enter - at yello - variants to solve
  • Ctrl+Alt+L - Reformat code
  • Alt+Enter - at error - fix error menu
  • F10 - menu
  • Esc+Esc - focus Editor
  • F12 - focus last tool window(run)
  • Shift+Esc - hide low "Run"
  • Ctrl+ +/- - unfold/fold
  • Ctrl+m - Enter

navigate (Goto by reference actions)

  • Ctrl+Alt+g, Alt+. - navigate to definition
  • Alt+, - Navigate; Back (my)

Windows

  • Alt+1 - project navigation
  • Alt+2 - bookmars and debug points
  • Alt+4 - console
  • Alt+5 - debug
  • F11 - create
  • Ctrl-Shift+F8 - debug points
  • Shift-F11 bookmars
  • shift+Esc - hide current window
  • switch to main window - shift+Esc or F4 or Alt+current window or double Alt+any
  • C-x k - close current tab

not emacs

  • Ctrl+/ - comment text
  • Ctrl+b - navigate to definition

61.5. ipython

  • Ctrl+e Ctrl+o - multiline code or if 1:
  • Ctrl+r - search in history

61.6. geany

no autocompletion

61.7. BlueFish

Style - preferences->Editor settings->Fonts&Colours->Use system wide color settings

  • S-C-c comment
  • C-space completion

to execute file:

  • preferences->external commands->
    • any name: xfce4-terminal -e 'bash -c "python %f; exec bash"'

cons

  • cannot execute

61.8. Eric

61.9. Google Colab

61.9.2. initial config

  • Runtime -> View resources -> Change runtime tupe - GPU
  • Editor -> Code diagnostics -> Syntax and type checking
  • Miscellaneous -> Power level - ?

61.9.3. keys (checked):

  • Ctrl-a/e Move cursor to the beginning/end of the line
  • Ctrl-Alt-n/p Move cursor to the beginning of the line
  • Ctrl-d/h Delete next/previous character in line
  • Ctrl-k Delete text from cursor to end of line
  • Ctrl-space auto completion
  • Ctrl+o new line and stay at current
  • Ctrl+j delete and of the line character and set cursor at the end
  • Ctrl+m m/y convert (code to text)/(text to code)
  • Ctrl+z/y undo/redo action

Docstring:

  • Ctrl + mouse over variable
  • Ctrl + space + mouse click

keys advanced (checked)

  • Ctrl+s save notebook
  • Ctrl+m activate the shortcuts
  • Ctrl+m h get Keyboard preferences
  • Tab Toggle code docstring help
  • Shift+Tab Unindent current line
  • Ctrl+m n/p next/preview cell (like arrows)
  • Ctrl+] Collapse
  • Ctrl+' toggle collapse
  • Ctrl+Shift+Enter Run
  • Ctrl+Shift+S select focused cell
  • Ctrl+m o show hide output
  • Ctrl+m a/b add cell above/below
  • ctrl+m+d Delete cell
  • Ctrl+shift+alt+p command palette

61.9.4. keys in Internet (emacs IPython console)

Ctrl-C and Ctrl-V) for copying and pasting in a wide variety of programs and systems

  • Ctrl-a Move cursor to the beginning of the line
  • Ctrl-e Move cursor to the end of the line
  • Ctrl-b or the left arrow key Move cursor back one character
  • Ctrl-f or the right arrow key Move cursor forward one character
  • Backspace key Delete previous character in line
  • Ctrl-d Delete next character in line
  • Ctrl-k Cut text from cursor to end of line
  • Ctrl-u Cut text from beginning of line to cursor
  • Ctrl-y Yank (i.e. paste) text that was previously cut
  • Ctrl-t Transpose (i.e., switch) previous two characters
  • Ctrl-p (or the up arrow key) Access previous command in history
  • Ctrl-n (or the down arrow key) Access next command in history
  • Ctrl-r Reverse-search through command history

?

  • Ctrl-l Clear terminal screen
  • Ctrl-c Interrupt current Python command
  • Ctrl-d Exit IPython session

61.9.5. Google Colab Magics

set of system commands that can be seen as a mini extensive command language

  • line magics start with %, while the cell magics start with %%
  • %lsmagic - full list of available magics
  • %ldir
  • %%html

more https://colab.research.google.com/notebooks/intro.ipynb

61.9.6. install libraries and system commands

61.9.7. execute code from google drive

# Run this cell to mount your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

!python3 "/content/drive/My Drive/Colab Notebooks/hello.py"

61.9.8. shell

from IPython.display import JSON
from google.colab import output
from subprocess import getoutput
import os

def shell(command):
  if command.startswith('cd'):
      path = command.strip().split(maxsplit=1)[1]
      os.chdir(path)
      return JSON([''])
  return JSON([getoutput(command)])
output.register_callback('shell', shell)
#@title Colab Shell
%%html
<div id=term_demo></div>
<script src="https://code.jquery.com/jquery-latest.js"></script>
<script src="https://cdn.jsdelivr.net/npm/jquery.terminal/js/jquery.terminal.min.js"></script>
<link href="https://cdn.jsdelivr.net/npm/jquery.terminal/css/jquery.terminal.min.css" rel="stylesheet"/>
<script>
  $('#term_demo').terminal(async function(command) {
      if (command !== '') {
          try {
              let res = await google.colab.kernel.invokeFunction('shell', [command])
              let out = res.data['application/json'][0]
              this.echo(new String(out))
          } catch(e) {
              this.error(new String(e));
          }
      } else {
          this.echo('');
      }
  }, {
      greetings: 'Welcome to Colab Shell',
      name: 'colab_demo',
      height: 250,
      prompt: 'colab > '
  });

61.9.9. gcloud

  • gcloud info - current environment

import torch print(torch.cuda.get_device_name())

LD_LIBRARY_PATH=/usr/lib64-nvidia watch -n 1 nvidia-smi

!gcloud auth login # Authorize gcloud to access the Cloud Platform with Google user credentials.

connect Google Colab to Google Cloud.

!gcloud compute ssh --zone us-central1-a 'instance-name' -- -L 8888:localhost:8888

61.9.10. gcloud ssh (require billing)

bad: !gcloud config set account account@gmail
!gcloud auth login
!gcloud projects create vfdsgq2345 --enable-cloud-apis --name vfdsgq2345 --set-as-default

Create in progress for [https://cloudresourcemanager.googleapis.com/v1/projects/vfdsgq2346]. Enabling service [cloudapis.googleapis.com] on project [vfdsgq2346]… Operation "operations/acat.p2-872588642643-8ef11211-5181-47e3-bcd2-383690de7d91" finished successfully. Updated property [core/project] to [vfdsgq2346].

!gcloud config set project 1
!gcloud compute ssh

gcloud compute ssh example-instance –zone=us-central1-a – -vvv -L 80:%INSTANCE%:80

!gcloud compute ssh 10.2.3.4:22 –zone=us-central1-a – -vvv -L 80:localhost:80

61.9.12. upload and download files

from google.colab import files
files.upload/download()

61.9.13. connect ssh (restricted)

https://medium.com/@ayaka_45434/connect-to-google-colab-using-ssh-bb342e0d0fd2

at relay server:

  • $ ssh-keygen -t ed25519 -a 256
  • $ cat .ssh/id_ed25519.pub

at colab:

%%sh
mkdir -p ~/.ssh
echo '<SSH public key of PC>' >> ~/.ssh/authorized_keys
apt update > /dev/null
yes | unminimize > /dev/null
apt install -qq -o=Dpkg::Use-Pty=0 openssh-server pwgen net-tools psmisc pciutils htop neofetch zsh nano byobu > /dev/null
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa > /dev/null
echo ListenAddress 127.0.0.1 >> /etc/ssh/sshd_config
mkdir -p /var/run/sshd
/usr/sbin/sshd

61.9.14. connect ssh (unrestricted)

at colab:

  1. !git clone https://github.com/WassimBenzarti/colab-ssh ; mv colab-ssh cs ; cd cs ; rm -r .git

!git clone –depth=1 https://github.com/openssh/openssh-portable ; mv openssh-portable cs ; cd cs ; rm -r .git ; autoreconf && ./configure && make && make install ; mv /usr/local/sbin/sshd /usr/local/sbin/aav

%%shell a=$(cat <<EOF AcceptEnv LANG LC_ALL LC_COLLATE LC_CTYPE LC_MESSAGES LC_MONETARY LC_NUMERIC LC_TIME LANGUAGE LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_NAME LC_PAPER LC_TELEPHONE AcceptEnv COLORTERM

Port 9090 ListenAddress 127.0.0.1 AllowUsers u

PermitRootLogin no PubkeyAuthentication yes PasswordAuthentication no PermitEmptyPasswords no KbdInteractiveAuthentication no EOF ) echo "$a" > aav.conf ; useradd -m sshd ; ls

!mkdir root.ssh ; chmod 0700 root.ssh ; mv cs/ssh aavc ; ./cs/ssh-keygen -b 4096 -t rsa -f root.ssh/mykey_rsa -q -N "" ; cat root.ssh/mykey_rsa.pub > root.ssh/authorized_keys

!exec /usr/local/sbin/aav -f aav.conf

!cat root.ssh/mykey_rsa.pub > root.ssh/authorized_keys

!./aavc -vvv -p 9090 localhost

61.9.15. Restrictions

disallowed from Colab runtimes:

  • file hosting, media serving, or other web service offerings not related to interactive compute with Colab
  • downloading torrents or engaging in peer-to-peer file-sharing
  • using a remote desktop or SSH
  • connecting to remote proxies
  • mining cryptocurrency
  • running denial-of-service attacks
  • password cracking
  • using multiple accounts to work around access or resource usage restrictions
  • creating deepfakes

61.9.16. cons

  • GPU/TPU usage is limited
  • Not the most powerful GPU/TPU setups available
  • Not the best de-bugging environment
  • It is hard to work with big data
  • Have to re-install extra dependencies every new runtime
  • Google drive: limited to 15 GB of free space with a Gmail id.
  • you’ll have to (re)install any additional libraries you want to use every time you (re)connect to a Google Colab notebook.

Alternatives:

  • Kaggle
  • Azure Notebooks
  • Amazon SageMaker
  • Paperspace Gradient
  • FloydHub

62. Jupyter Notebook

https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Importing Notebooks.html .ipynb

у каждой cell желательно обеспечить идемпотентность

62.1. jupyter [ˈʤuːpɪtə] - акцентом на интерактивности производимых вычислений

  • https://jupyter.org/
  • Идея - не рисовать, а отбирать работающие правила
  • many languages https://github.com/jupyter/jupyter/wiki/Jupyter-kernels
  • Project Jupyter - nonprofit organization, interactive computing across dozens of programming languages. Free for all to use and released under the liberal terms of the modified BSD license
    • Jupyter Notebook -web-based - .ipynb - Jupyter Notebook is MathJax-aware (subset of Tex and LaTeX.)
    • Jupyter Hub
    • Jupyter Lab - interfaces for all products under the Jupyter ecosystem, редактирование изображений, CSV, JSON, Markdown, PDF, Vega, Vega-Lite
    • next-generation version of Jupyter Notebook
    • Jupyter Console
    • Qt Console

kernels: jupyter kernelspec list

%run -n main.py  - import module

62.2. install

pip3 install nbconvert --user

launch:

  • cd to folder with .ipynb
  • jupyter-notebooks # it will open browser

62.3. convert to htmp

ipython nbconvert /home/u2/tmp/Lecture_10_decision_trees.ipynb

62.4. Widgets

62.4.1. install

run

  • pip install ipywidgets –user
  • jupyter nbextension enable –py widgetsnbextension

62.4.2. usage

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

date_w = widgets.DatePicker(
    description='Pick a Date',
    disabled=False
)

def f(x):
    return x

interact(f, x=date_w) # x - name of f(x) parameter and *type of widget*
interact(f, x=10); # int slider (abbrev)
interact(f, x=True); # bool flag (abbrev)

interact(h, p=5, q=fixed(20)); # q parameter is fixed

62.4.3. widget abbreviation

Checkbox
True or False
Text
'Hi there'
IntSlider
value or (min,max) or (min,max,step) if integers are passed
FloatSlider
value or (min,max) or (min,max,step) if floats are passed
Dropdown
['orange','apple'] or `[(‘one’, 1), (‘two’, 2)]

62.4.4. widget return type

widgets.DatePicker
datetime.date

62.4.5. Styling

https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Styling.html

Description

  • style = {'description_width': 'initial'}
  • IntSlider(description='A too long description', style=style)

62.5. Hotkeys:

  • Enter - in cell
  • Escepe - exit cell
  • h - hotkeys
  • Ctrl+Enter/Shift+Enter - run
  • Tab - code completion
  • arrow up/down - above/below cell

62.6. emacs (sucks)

org-mode may evaluate code blocks using a Jupyter kernel https://github.com/gregsexton/ob-ipython

jupyter_console, jupyter_client

62.8. TODO lab

k

63. USΕ CASES

measure time 28.3

63.1. NET

63.1.1. REST request

import urllib.request
import json


API_KEY = 'f670813c14f672c1e197101fd767cbe675933d86'
headers = {'User-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5',
           'Content-Type': 'application/json',
           'Accept': 'application/json',
           'Authorization': 'Token ' + API_KEY
}

data = '{ "query": "Виктор Иван", "count": 3 }'
req = urllib.request.Request(url='https://suggestions.dadata.ru/suggestions/api/4_1/rs/suggest/fio',
                             headers=headers, data=data.encode())
with urllib.request.urlopen(req) as f:
    r = f.read().decode('utf-8')
    j = json.loads(r)
    print(j['suggestions'][0]["unrestricted_value"])
    print(j['suggestions'][0]["gender"])
    j2 = json.dumps(j, ensure_ascii=False, indent=4)
    print(j2)

63.1.2. email IMAP

import configparser as cp
import cx_Oracle
import datetime
import email
import imaplib
import logging
import os
import re
import requests
import shutil
import smtplib
import zipfile
import sys

from email.header import decode_header
from email.mime.application import MIMEApplication
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.utils import formatdate
from os.path import basename
from requests.auth import HTTPBasicAuth
from sys import exit

def decode_header_fix(subject_list: list) -> str:
    """ decode to string any header after decode_header"""
    sub_list = []
    for subject in subject_list:
        if subject and subject[1]:
            subject = (subject[0].decode(subject[1]))
        elif type(subject[0]) == bytes:
            subject = subject[0].decode('utf-8')
        else:
            subject = subject[0]
        sub_list.append(subject)
    return ''.join(sub_list)


def send_mail(username, password, send_from, send_to, subject,
              text, files=None, server="mx1.rnb.com"):
    assert isinstance(send_to, list)

    msg = MIMEMultipart()
    msg['From'] = send_from
    msg['To'] = COMMASPACE.join(send_to)
    msg['Date'] = formatdate(localtime=True)
    msg['Subject'] = subject

    msg.attach(MIMEText(text))

    for f in files or []:
        with open(f, "rb") as fil:
            part = MIMEApplication(
                fil.read(),
                Name=basename(f)
            )
        # After the file is closed
        part['Content-Disposition'] = 'attachment; filename="%s"' % basename(f)
        msg.attach(part)

    smtp = smtplib.SMTP(server)
    smtp.login(username, password)
    log.debug(u'Отправляю письмо на %s' % send_to)
    smtp.sendmail(send_from, send_to, msg.as_string())
    smtp.close()


def save_attachment(conn: imaplib.IMAP4, emailid: str, outputdir: str, file_pattern: str):
    """ https://docs.python.org/3/library/imaplib.html

    :param conn: connection
    :param emailid:
    :param outputdir:
    :param file_pattern: regex pattern for file name of attachment
    :return:
    """
    try:
        ret, data = conn.fetch(emailid, "(BODY[])")
    except:
        "No new emails to read."
        conn.close_connection()
        exit()
    mail = email.message_from_bytes(data[0][1])
    # print('From:' + mail['From'])
    # print('To:' + mail['To'])
    # print('Date:' + mail['Date'])
    # subject_list = decode_header(mail['Subject'])
    # subject = decode_header_fix(subject_list) # must be: Updating client ICODE RNB_378026
    # print('Subject:' + subject)
    # print('Content:' + str(mail.get_payload()[0]))

    # process_out_reestr(mail)

    if mail.get_content_maintype() != 'multipart':
        return
    for part in mail.walk():
        if part.get_content_maintype() != 'multipart' and part.get('Content-Disposition') is not None:
            filename_list = decode_header(part.get_filename())  # (encoded_string, charset)
            filename = decode_header_fix(filename_list)
            if not re.search(file_pattern, filename):
                continue
            # write attachment
            print("OKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK")
            with open('{}/{}'.format(outputdir, filename), 'wb') as f:
                f.write(part.get_payload(decode=True))


def download_email_attachments(server: str, user: str, password: str, outputdir: str,
                               subject_contains: str, file_pattern: str, days_since=0) \
        -> bool or None:
    """

    :param server:
    :param user:
    :param password:
    :param outputdir:
    :param subject_contains:
    :param file_pattern:
    :param days_since:
    :return:
    """
    date = datetime.datetime.now() - datetime.timedelta(days=days_since)
    # https://docs.python.org/3/library/imaplib.html
    # https://tools.ietf.org/html/rfc3501#page-49
    # SUBJECT <string>
    #          Messages that contain the specified string in the envelope
    #          structure's SUBJECT field
    criteria = '(SENTSINCE "{}" SUBJECT "{}")'.format(date.strftime('%d-%b-%Y'),
                                                      subject_contains)

    try:
        m = imaplib.IMAP4_SSL(server)
        m.login(user, password)
        m.select()
        resp, items = m.search(None, criteria)
        if not items[0]:
            log.debug(u'Нет писем с реестрами в папке ВХОДЯЩИЕ')
            return False
        items = items[0].split()
        for emailid in items:
            save_attachment(m, emailid, outputdir, file_pattern)
            # TODO: change
            # m.store(emailid, '+FLAGS', '\\Seen')
            # m.copy(emailid, 'processed')
            # m.store(emailid, '+FLAGS', '\\Deleted')
        m.close()
        m.logout()
    except imaplib.IMAP4_SSL.error as e:
        print("LOGIN FAILED!!! ", e)
        sys.exit(1)
    return True


if __name__ == '__main__':
    import tempfile
    c = config_load('autocred.conf')
    log = init_logger(logging.INFO, c['storage']['log_path'])  # required by all methods
    #
    # with tempfile.TemporaryDirectory() as tmp:
    #     print(tmp)
    #     res = download_email_attachments(server=c['imap']['host'],
    #                                      user=c['imap']['login'],
    #                                      password=c['imap']['password'],
    #                                      outputdir=tmp, subject_contains='Updating client ICODE RNB_',
    #                                      file_pattern=r'^client_identity_RNB_\d+\.zip\.enc$', days_since=1)
    #     extract_zip_files(tmp)
    #     for x in os.listdir(tmp):
    #         print(x)

    tmp = '/home/u2/Desktop/tmp/tmp2/'
    # res = download_email_attachments(server=c['imap_bistr']['host'],
    #                                  user=c['imap_bistr']['login'],
    #                                  password=c['imap_bistr']['password'],
    #                                  outputdir=tmp,
    #                                  subject_contains='Updating client ICODE',  # 'Updating client ICODE RNB_378026'
    #                                  file_pattern=r'^client_identity_RNB_\d+\.zip\.enc$', days_since=3)

    for filename in os.listdir(tmp):
        print(filename)
        decrypt_file(uri=c['api']['dec_uri'],
                     cert_thumbprint=c['api']['dec_cert_thumbprint'],
                     user=c['api']['user'],
                     passw=c['api']['pass'],
                     filename=os.path.join(tmp, filename))
    for x in os.listdir(tmp):
        print(x)

63.1.3. email DKIM

('DKIM-Signature', 'v=1; a=rsa-sha256; q=dns/txt; c=simple/simple; d=bystrobank.ru\n\t; s=dkim;
h=Message-Id:Content-Type:MIME-Version:From:Date:Subject:To:Sender:\n\tReply-To:Cc:Content-Transfer-Encoding:Content-ID:Content-Description:\n\tResent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:\n\tIn-Reply-To:References:List-Id:List-Help:List-Unsubscribe:List-Subscribe:\n\tList-Post:List-Owner:List-Archive;\n\tbh=dDimDD8KIdEx1QkqygEiFeQfyTIgIztxgQu6BtkzQ5o=;
b=hZGPWUFnQ2gGNV4UJ7MyaPJYFL\n\tbB9Csmpg/ukcwQuWBI1NtvILUoviMff4ACkNnhPgD7OV4aGtR5UBOy81tdvY5cQnBFv9Yku9yAf8R\n\t1BV83crKYnhU4GRtw7wD4W64zpZRhX3KZxG8SWissmh+vNEMBlmYXN9FsuLyVKaBbks0DYnR3HA9Q\n\tFV4d8CMC8wLrdmBi/MV0x75Q9GhDhGMc8MPNAleuWabHOT8Bmf7FLHQERHBRYm78i4wDWEFFNv5Ox\n\tuqMEm5iJQeYRnoHkrm5KEEP4DYohb8GgJkfIIZs4dO2oMjJif/2A1JLnmq64KPmoAE3s8lO2Bo2Zq\n\t68tnSdFA==;')
pip3 install dkimpy --user
import dkim
# verify email
    try:
        res = dkim.verify(data[0][1])
    except:
        log.error(u'Invalid signature')
        return
    if not res:
        log.error(u'Invalid signature')
        return
    print('[' + os.path.basename(__file__) + '] isDkimValid = ' + str(res))

    mail = email.message_from_bytes(data[0][1])
    # verify sender domain
    dkim_sig = decode_header(mail['DKIM-Signature'])
    dkim_sig = decode_header_fix(dkim_sig)
    if not re.search(r" d=bystrobank\.ru", dkim_sig):
        return

63.1.4. urllib SOCKS

pip install requests[socks]

import urllib
import socket
import socks
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1",port=8888)
save = socket.socket
socket.socket = socks.socksocket # replace socket with socks
req = urllib.request.Request(url='http://httpbin.org/ip')
urllib.request.urlopen(req).read() # default request

63.2. LISTS

63.2.1. all has one value

list.count('value') == len(list)

63.2.2. 2D list to 1D dict or list

[j for sub in [[1,2,3],[1,2],[1,4,5,6,7]] for j in sub]
{j for sub in [[1,2,3],[1,2],[1,4,5,6,7]] for j in sub}

63.2.3. list to string

' '.join(w for w in a)

63.2.4. replace one with two

l[pos:pos+1] = ('a', 'b')

63.2.5. remove elements

filter

self.contours = list(filter(lambda a: a is not None, self.contours))

new list

a = [item for item in a if ...]

iterate over copy

for i, x in enumerate(lis[:]):
  del lis[i]

63.2.6. average

[np.average((x[0], x[1])) for x in zip([1,2,3],[1,2,3])]

63.2.7. [1, -2, 3, -4, 5]

>>> [(x % 2 -0.5)*2*x for x in range(1,10)]
[1.0, -2.0, 3.0, -4.0, 5.0, -6.0, 7.0, -8.0, 9.0]

63.2.8. ZIP массивов с разной длинной

import itertools
z= itertools.zip_longest(arr1,arr2,arr3)
flat_list=[]
for x in z:
    subflat=[]
    for subl in x:
        if subl != None:
            subflat.append(subl[0])
            subflat.append(subl[1])
            subflat.append(subl[1])
        else:
            subflat.append('')
            subflat.append('')
    flat_list.append(subflat)


63.2.9. Shuffle two lists

z = list(zip(self.x, self.y))
z = random.shuffle(z)
self.x, self.y = zip(*z)

63.2.10. list of dictionaries

  1. search and encode
    def one_h_str_col(dicts: list, column: str):
        c = set([x[column] for x in dicts])  # unique
        c = list(c)  # .index
        nb_classes = len(c)
        targets = np.arange(nb_classes)
        one_hot_targets = np.eye(nb_classes)[targets]
        for i, x in enumerate(dicts):
            x[column] = list(one_hot_targets[c.index(x[column])])
        return dicts
    
    
    def one_h_date_col(dicts: list, column: str):
        for i, x in enumerate(dicts):
            d: date = x[column]
            x[column] = d.year
        return dicts
    
    
    def one_h(dicts: list):
        for col in dicts[0].keys():
            lst = set([x[col] for x in dicts])
            if all(isinstance(x, (str, bytes)) for x in lst):
                dicts = one_h_str_col(dicts, col)
            if all(isinstance(x, date) for x in lst):
                dicts = one_h_date_col(dicts, col)
        return dicts
    
    dicts = [
    { "name": "Mark", "age": 5 },
    { "name": "Tom", "age": 10 },
    { "name": "Pam", "age": 7 },
    ]
    
    c = set([x['name'] for x in dicts]) # unique
    c = list(c)  # .index
    
    for i, x in enumerate(dicts):
      x['name'] = c.index(x)
    
    
  2. separate labels from matrix
    matrix = [list(x.values()) for x in dicts]
    labels = dicts[0].keys()
    

63.2.11. closest in list

alph = [1,2,5,7]
source = [1,2,3,6] # 3, 6 replace
target = source[:]
for i, s in enumerate(source
  if s not in alph:
    distance = [(abs(x-s), x) for x in alph
    res = min(distance, key=lambda x: x[0])
    target[i] = res[1]

63.2.12. TIMΕ SEQUENCE

smooth

mean_ver1 = pandas.Series(mean_ver1).rolling(window=5).mean()

63.2.13. split list in chunks

our_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
chunk_size = 3
chunked_list = [our_list[i:i+chunk_size] for i in range(0, len(our_list), chunk_size)]
print(chunked_list)

63.3. FILES

  • os.path.join('/home','user') - /home/user
  • os.listdir('/home/user') -> list of file_names - files and directories
  • os.path.isdir/isfile() -> True False
  • os.walk() - subderictories = [(folder_path, list_folders, list_files), … ]
  • extension = os.path.splitext(filename)[1][1:]

Extract from subolders: find . -mindepth 2 -type f -print -exec mv {} . \;

63.3.1. Read JSON

import codecs
fileObj =codecs.open("provodki_1000.json", encoding='utf-8', mode='r')
text = fileObj.read()
fileObj.close()
data = json.loads(text)

# or
import json
with open('test_data.txt', 'r') as myfile:
    data=myfile.read()
obj = json.loads(data)

data = json.loads(text)

63.3.2. CSV

  1. array to CSV file for Excell
    wtr = csv.writer(open ('out.csv', 'w'), delimiter=';', lineterminator='\n')
    for x in arr :
        wtr.writerow(x)
    
  2. read CSV and write
    import csv
    
    p = '/home/u2/Downloads/BANE_191211_191223.csv'
    
    with open(p, 'r') as f:
        reader = csv.reader(f, delimiter=';', quoting=csv.QUOTE_NONE)
        for row in reader:
    

63.3.3. read file

Whole:

import codecs
fileObj =codecs.open("provodki_1000.json", encoding='utf-8', mode='r')
text = fileObj.read()
fileObj.close()

Line by line:

with open(fname) as f:    content = f.readline()

go to the begining of the file

file.seek(0)

read whole text file:

with open(fname) as f:    content = f.readlines()
with open(fname) as f: temp = f.read().splitlines()

63.3.4. Export to Excell

https://docs.python.org/3.6/library/csv.html

import csv
wtr = csv.writer(open('out.csv', 'w'), delimiter=';', lineterminator='\n')
wtr.writerows(flat_list)

63.3.5. NameError: name 'A' is not defined

try:
    file.close()
except NameError:

63.3.6. rename files (list directory)

import os
from shutil import copyfile

sd = '/mnt/hit4/hit4user/kaggle/abstraction-and-reasoning-challenge/training/'

td = '/mnt/hit4/hit4user/kaggle/abc/training/'
dirFiles = os.listdir(sd)
dirFiles.sort(key=lambda f: int(f[:-5], base=16))
for i, x in enumerate(dirFiles):
    src = os.path.join(sd,x)
    dst = os.path.join(td,str(i))
    copyfile(src, dst)

63.3.7. current directory

import sys, os
os.path.abspath(sys.argv[0])

63.4. STRINGS

63.4.1. String comparision

https://stackabuse.com/comparing-strings-using-python/

  • == compares two variables based on their actual value
  • is operator compares two variables based on the object id

Rule: use == when comparing immutable types (like ints) and is when comparing objects.

  • a.lower() == b.lower()
  1. difflib.SequenceMatcher - gestalt pattern matching
    from difflib import SequenceMatcher
    m = SequenceMatcher(None, "NEW YORK METS", "NEW YORK MEATS")
    m.ratio() ⇒ 0.962962962963
    # disadvantage:
    fuzz.ratio("YANKEES", "NEW YORK YANKEES") ⇒ 60 # same team
    fuzz.ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 75 # different teams
    
    # fix: best partial:
    from difflib import SequenceMatcher
    
    def a(s1,s2):
        if len(s1) <= len(s2):
            shorter = s1
            longer = s2
        else:
            shorter = s2
            longer = s1
    
        m = SequenceMatcher(None, shorter, longer)
        blocks = m.get_matching_blocks()
        scores = []
        for block in blocks:
            long_start = block[1] - block[0] if (block[1] - block[0]) > 0 else 0
            long_end = long_start + len(shorter)
            long_substr = longer[long_start:long_end]
    
            m2 = SequenceMatcher(None, shorter, long_substr)
            r = m2.ratio()
            if r > .995:
                return 100
            else:
                scores.append(r)
    
        return int(round(100 * max(scores)))
    
    s1="MEATS"
    s2="NEW YORK MEATS"
    
    
    print(a("asd", "123asd")) # 100
    print(a("asd", "asd123")) # 100
    
    
    
    
  2. https://en.wikipedia.org/wiki/Levenshtein_distance
    def levenshtein(s: str, t: str) -> int:
        """
    
        :param s:
        :param t:
        :return: 0 - len(s)
        """
        if s == "":
            return len(t)
        if t == "":
            return len(s)
        cost = 0 if s[-1] == t[-1] else 1
    
        i1 = (s[:-1], t)
        if not i1 in memo:
            memo[i1] = levenshtein(*i1)
        i2 = (s, t[:-1])
        if not i2 in memo:
            memo[i2] = levenshtein(*i2)
        i3 = (s[:-1], t[:-1])
        if not i3 in memo:
            memo[i3] = levenshtein(*i3)
        res = min([memo[i1] + 1, memo[i2] + 1, memo[i3] + cost])
    
        return res
    
  3. hamming distance
    import hashlib
    
    def hamming_distance(chaine1, chaine2):
        return sum(c1 != c2 for c1, c2 in zip(chaine1, chaine2))
    
    def hamming_distance2(chaine1, chaine2):
        return len(list(filter(lambda x : ord(x[0])^ord(x[1]), zip(chaine1, chaine2))))
    print(hamming_distance("chaine1", "chaine2"))
    
    print(hamming_distance2("chaine1", "chaine2"))
    

63.4.2. Remove whitespaces

line = " ".join(line.split()) # resplit

63.4.3. Unicode

  • '\u2116'.encode("unicode_escape")
    • b'\\u2116'
  • print('№'.encode("unicode_escape"))
    • b'\\u2116'
  • print('\u2116'.encode("utf-8")) # sometimes do wrong
    • b'\xe2\x84\x96'
  • print(b'\xe2\x84\x96'.decode('utf-8'))
  • print('\u2116'.encode("utf-8").decode('utf-8'))
  1. terms
    • code points, first two characters are always "U+", hexadecimal. At least 4 hexadecimal digits are shown, prepended with leading zeros as needed. ex: U+00F7
    • BOM - magic number at the start of a text
      • UTF-8 byte sequence EF BB BF, permits the BOM in UTF-8, but does not require or recommend its use.
      • Not using a BOM allows text to be backwards-compatible with software designed for extended ASCII.
      • In UTF-16, a BOM (U+FEFF), byte sequence FE FF
    • UTF-8 Encoding or Hex UTF-8 - hex representation of encoded 1-4 bytes.
  2. Encoding formats: UTF-8, UTF-16, GB18030, UTF-32

    utf-8

    • ASCII-compatible
    • 1-4 bytes for each code point

    UTF-16

    • ASCII-compatible

    GB18030

  3. utf-8
    First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
    U+0000 U+007F 0xxxxxxx      
    U+0080 U+07FF 110xxxxx 10xxxxxx    
    U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx  
    U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

63.4.4. To find all the repeating substring in a given string

https://stackoverflow.com/questions/41077268/python-find-repeated-substring-in-string

You can do it by repeating the substring a certain number of times and testing if it is equal to the original string.

def levenshtein(s: str, t: str) -> int:
    """

    :param s:
    :param t:
    :return: 0 - len(s)
    """
    if s == "":
        return len(t)
    if t == "":
        return len(s)
    cost = 0 if s[-1] == t[-1] else 1

    i1 = (s[:-1], t)
    if not i1 in memo:
        memo[i1] = levenshtein(*i1)
    i2 = (s, t[:-1])
    if not i2 in memo:
        memo[i2] = levenshtein(*i2)
    i3 = (s[:-1], t[:-1])
    if not i3 in memo:
        memo[i3] = levenshtein(*i3)
    res = min([memo[i1] + 1, memo[i2] + 1, memo[i3] + cost])

    return res


c = '03105591400310559140031055914003105591400310559140031055914003105591400310559140'
c = '0310559140031055914031055914003105591400310591400310559140031055910030559140'
a=[]
for j in range(10):
    for i in range(7):
        if (i*10+10+j) <= len(c):
            a.append(c[i*10+j:i*10+10+j])
v = {x: a.count(x) for x in a if a.count(x) >2}
#for k in v.keys():
#    print(k, levenshtein(k*8,c)
re = {k: levenshtein(k*8,c) for k in v.keys()}
print(sorted(re, key=re.__getitem__)[0]) # asc
0310559140 4
3105591400 6
1055914003 8
0559140031 10
5591400310 12
5914003105 14
9140031055 12
1400310559 10
4003105591 8
0031055914 6
'3105591400310559140031055914003105591400310559140031055914003105591400310559140'
3105591400 1
1055914003 3
0559140031 5
5591400310 7
5914003105 9
9140031055 9
1400310559 7
4003105591 5
0031055914 3
0310559140 1 - THIS

63.4.5. first substring

  • str.find
  • by regex:
m = re.search("[0-9]*")
  if m:
    num = d[m.start():m.end()]

63.5. DICT

add

d1.update(d2) # d1 = d1+d2

find max value

import operator
max(d1.items(), key=operator.itemgetter(1))[0]

for

  • for key in dict:
  • for key, value in dict.items():

sorted dict

abb_sel_diff_middle[wind] = sum/len(abb_sel_diff[wind])
c = sorted(abb_sel_diff_middle.items(), key=lambda kv: kv[1], reverse=True) #dsc
numbers = {'first': 2, 'second': 1, 'third': 3, 'Fourth': 4}
sorted(numbers, key=numbers.__getitem__)
>>['second', 'first', 'third', 'Fourth']

merge two dicts

z={**x, **y}

63.5.1. del

loop with clone

for k,v in list(d.items()):
  if v is bad:
     del d[k]
# or
{k,v for k,v in list(d.items()) if v is not bad}

filter

self.contours = list(filter(lambda a: a is not None, self.contours))

63.6. argparse: command line arguments

63.6.1. terms

  • positional arguments - arguments without options (main.py input_file.txt)
  • options that accept values (–file a.txt)
  • on/off flags - options without any vaues (–overwrite)

63.6.2. usage

import sys
>>> print(sys.argv)

or

import argparse



def main(args):
    args.batch_size

if __name__ == '__main__':
parser = argparse.ArgumentParser()
    parser.add_argument("--data_dir", help="data directory", default='./data')
    parser.add_argument("--default_settings", help="use default settings", type=bool, default=True)
    parser.add_argument("--combine_train_val", help="combine the training and validation sets for testing", type=bool,
                        default=False)
    main(parser.parse_args())

63.6.3. optional positional argument

parser.add_argument('bar', nargs='?', default='d')

63.7. way to terminate

sys.exit()

63.8. JSON

may be array or object

  • замана " на \"
  • замена \ на

63.9. NN EQUAL QUANTITY FROM SAMPLES

    lim = max(count.values())*2 # limit for all groups
    print(count.values())
    print('max', max)

    for _, v in count.items(): # v - quantity
        c = 0 # current quantity
        for _ in range(v):  # i - v-1
            r = round(lim / v) #
            if c < lim + r:
                diff = 0
                if (c + r) > lim:
                    diff = c + r - lim
                #create: r - diff
                c += r - diff # may be removed
        print(c)

# Or in class -------------
import math

class Counter:
    def __init__(self, limit):  # , multiplyer):
        self.lim: int = limit  # int(max(amounts) * multiplyer)
        print("Counter limit:", self.lim)

    def new_count(self, one_amount):
        self.c: int = 0  # done counter
        self.r: int = math.ceil(self.lim / one_amount)  # multiplyer
        # x + y = one_amount
        # x* r + y = lim
        # y = one_amount - x  # without duplicates
        # x*r + one_amount - x = lim  # with duplicates
        # x*(r - 1) = lim - one_amount
        # x = (lim - one_amount) / (r - 1)
        if self.r == 1:
            self.wd = self.lim
        else:
            self.wd = (self.lim - one_amount) / (self.r - 1)    # take duplicates
            self.wd = self.wd * self.r

    def how_many_now(self) -> int:
        """ called one_amount times
        :return how many times repeat this sample to equal this one_amount to others
        """
        diff: int = 0
        if self.c > self.wd:
            r: int = 1
        else:
            r: int = self.r
        if (self.c + r) > self.lim:
            diff = self.c + r - self.lim  # last return

        self.c += r - diff  # update counter
        return int(r - diff)

counts = [20,30,10,7,100]
multiplyer = 2
counter = Counter(counts, multiplyer)
for v in counts:  # v - quantity
    counter.new_count(v)
    c = 0
    for _ in range(v):  # i - v-1 # one item
        c += counter.how_many_now()
    print(c)

63.10. most common ellement

def most_common(lst):
    return max(set(lst), key=lst.count)

mc = most_common([round(a, 1) for a in degrees if abs(a) != 0])
filtered_degrees = []
for a in degrees:
    if round(a, 1) == mc:
       filtered_degrees.append(a)
med_degree = float(np.median(filtered_degrees))


# max char
s3 = 'BEBBBB'
s3 = {x: s3.count(x) for x in s3}
mc = sorted(s3.values())[-1]
s3 = [key for key, value in s3.items() if value == mc][0]  # most common

63.11. print numpers

n=123123123412
print(f"{n:,}")

>>> 123,123,123,412

63.12. SCALE

# to range 0 1
def scaler_simple(data: np.array) -> np.array:
    """ in range (0,1)

    :param data: one dimensions
    :return:(0,1)
    """
    data_min = np.nanmin(data)
    data_max = np.nanmax(data)
    data = (data - data_min) / (data_max - data_min)
    return data

# -(0 - 5) / 5
# to range -1 1
def scaler_simple(data: np.array) -> np.array:
    """ in range (0,1)

    :param data: one dimensions
    :return:(0,1)
    """
    data_min = np.nanmin(data)
    data_max = np.nanmax(data)
    data =(data_max/2 - data) / (data_max - data_min) / 2
    return data

# (0,1) to (-1,1)
data = (0.5 - data) / 0.5
# (-1,1) to (0,1)
data = (1 - data) / 2

def my_scaler(data: np.array) -> np.array:
    """ data close to 0 will not add much value to the learning process

    :param data: two dimensions 0 - time, 1 - prices
    :return:
    """

    # data = scaler(data, axis=0)
    smoothing_window_size = data.shape[0] // 2  # for 10000 - 4
    dl = []
    for di in range(0, len(data), smoothing_window_size):
        window = data[di:di + smoothing_window_size]
        # print(window.shape)
        window = scaler(window, axis=1)
        # print(window[0], window[-1])
        dl.append(window)  # last line will be shorter

    return np.concatenate(dl)

63.13. smoth

def savitzky_golay(y, window_size, order, deriv=0, rate=1):

    import numpy as np
    from math import factorial

    try:
        window_size = np.abs(np.int(window_size))
        order = np.abs(np.int(order))
    except ValueError as msg:
        raise ValueError("window_size and order have to be of type int:", msg)
    if window_size % 2 != 1 or window_size < 1:
        raise TypeError("window_size size must be a positive odd number")
    if window_size < order + 2:
        raise TypeError("window_size is too small for the polynomials order")
    order_range = range(order+1)
    half_window = (window_size -1) // 2
    # precompute coefficients
    b = np.mat([[k**i for i in order_range] for k in range(-half_window, half_window+1)])
    m = np.linalg.pinv(b).A[deriv] * rate**deriv * factorial(deriv)
    # pad the signal at the extremes with
    # values taken from the signal itself
    firstvals = y[0] - np.abs(y[1:half_window+1][::-1] - y[0])
    lastvals = y[-1] + np.abs(y[-half_window-1:-1][::-1] - y[-1])
    y = np.concatenate((firstvals, y, lastvals))
    return np.convolve(m[::-1], y, mode='valid')

63.14. one-hot encoding

63.14.1. we have [1,3] [1,2,3,4], [3,4] -> numbers

import numpy as np
nb_classes = 4
targets = np.array([[2, 3, 4, 0]]).reshape(-1)
one_hot_targets = np.eye(nb_classes)[targets]
res:int = sum([x*(2**i) for i, x in enumerate(sum(one_hot_targets))]) # from binary to integer

63.14.2. column of strings

def one_h_str_col(col: np.array, name: str):
    c = list(set(col))  # unique
    print(name, c)  # encoding
    res_col = []
    for x in col:
        ind = c.index(x)
        res_col.append(ind)
    return np.array(res_col)

63.15. binary encoding

            s_ids = []
            for service_id, cost in cursor1.fetchall():  # service_id = None, 1,2,3,4
                service_id = 0 if service_id is None else int(service_id)
                s_ids.append(int(service_id))
            targets = np.array(s_ids).reshape(-1)
            s_id = 0
            if targets:
                one_hot_targets = np.eye(6)[targets]  # 5 classes
                s_id: int = sum([x * (2 ** i) for i, x in enumerate(sum(one_hot_targets))])  # from binary to integer

63.16. map encoding

df['`condition`'] = df['`condition`'].map({'new': 0, 'uses': 1})

63.17. Accuracy

import numpy as np

Accuracy = (TP+TN)/(TP+TN+FP+FN):

print("%f" % (np.round(ypred2) != labels_test).mean())

Precision = (TP) / (TP+FP)

63.18. garbage collect

del train, test; gc.collect()

63.19. Class loop for member varibles

for x in vars(instance): # string names
   v = vars(e)[x]  # varible

63.20. filter special characters

print("asd")
import re
def remove_special_characters(character):
    if character.isalnum() or character == ' ':
        return True
    else:
        return False
text = 'datagy -- is. great!'
new_text = ''.join(filter(remove_special_characters, text))
print(new_text)

63.21. measure time

import time
start_time = time.time()
main()
print("--- %s seconds ---" % (time.time() - start_time))

63.22. primes in interval

#!/usr/bin/python
import sys
m = 2
n = 10
primes = [i for i in range(m,n) if all(i%j !=0 for j in range(2,int(i**0.5) + 1))]
print(primes)
[2, 3, 5, 7]

63.23. unicode characters in interval

emacs character info: C-x =

import sys
a = 945
b = 961
for i in range(a,b + 1):
    print(" ".join([str(i)," ",chr(i)]))
945   α
946   β
947   γ
948   δ
949   ε
950   ζ
951   η
952   θ
953   ι
954   κ
955   λ
956   μ
957   ν
958   ξ
959   ο
960   π
961   ρ

64. Flask

  • Flask and Quart built on Werkzeug and uses Jinja for templating.
  • Flask wraps Werkzeug, allowing it to take care of the WSGI intricacies while also offering extra structure and patterns for creating powerful applications.
  • Quart — an async reimplementation of flask

Flask will never have a database layer. Flask itself just bridges to Werkzeug to implement a proper WSGI application and to Jinja2 to handle templating. It also binds to a few common standard library packages such as logging. Everything else is up for extensions.

64.1. terms

view
view function is the code you write to respond to requests to your application
Blueprints
way to organize a group of related views and other code. Flask associates view functions with blueprints when dispatching requests and generating URLs.

64.2. components

Jinja
template engine https://jinja.palletsprojects.com/
Werkzeug
WSGI toolkit https://werkzeug.palletsprojects.com/
Click
CLI toolkit https://click.palletsprojects.com/
MarkupSafe
escapes characters so it is safe to use in HTML and XML https://markupsafe.palletsprojects.com/
ItsDangerous
safe data serialization library, store the session of a Flask application in a cookie without allowing users to tamper with the session contents. https://itsdangerous.palletsprojects.com/
importlib-metadata
import at middle of execution for optional module dotenv.
zipp
?

64.3. static files and debugging console

64.3.1. get URL

from flask import url_for
from flask import redirect
@app.route("/")
def hell():
    return redirect(url_for('static', filename='style.css'))

64.3.2. path and console

default:

  • in localhost:8080/console
    • >>> print(app.static_folder)
      • /home/u/static
    • >>> print(app.static_url_path)
      • /static
    • >>> print(app.template_folder)
      • templates

if we set: app = Flask(static_folder='test')

  • >>> print(app.static_folder)
  • /home/u/test
  • >>> print(app.static_url_path)
  • /test
app = Flask(__name__, template_folder='./',
            static_url_path='/static',
            static_folder='/home/u/sources/documents_recognition_service/docker/worker/code/test'
            )

64.4. start, run

ways to run:

64.4.1. start $flask run (recommended)

export FLASK_RUN_debug=false
export FLASK_RUN_HOST=localhost FLASK_RUN_PORT=8080 ; flask --app main run --no-debug
export FLASK_APP=main
flask --app main run --debug

FLASK_COMMAND_OPTION - pattern for all options

  • FLASK_APP
print(app.config) # to get all configuration variables in app

64.4.2. start app.run()

app.run() or flask run

  • development web server

use gunicorn or uWSGI. production deployment

app.run()

  • host – the hostname to listen on.
  • port – the port of the web server.
  • debug – if given, enable or disable debug mode. automatically reload if code changes, and will show an interactive debugger in the browser if an error occurs during a request
  • load_dotenv – load the nearest .env and .flaskenv files to set environment variables.
  • use_reloader – should the server automatically restart the python process if modules were changed?
  • use_debugger – should the werkzeug debugging system be used?
  • use_evalex – should the exception evaluation feature be enabled?
  • extra_files – a list of files the reloader should watch additionally to the modules.
  • reloader_interval – the interval for the reloader in seconds.
  • reloader_type – the type of reloader to use.
  • threaded – should the process handle each request in a separate thread?
  • processes – if greater than 1 then handle each request in a new process up to this maximum number of concurrent processes.
  • passthrough_errors – set this to True to disable the error catching.
  • ssl_context – an SSL context for the connection.

64.5. Quart

# save this as app.py
from quart import Quart, request
from markupsafe import escape

app = Quart(__name__)

@app.get("/")
async def hello():
    name = request.args.get("name", "World")
    return f"Hello, {escape(name)}!"
# $ quart run
# * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

64.6. GET

64.6.1. variables

  • string (default) accepts any text without a slash
  • int accepts positive integers
  • float accepts positive floating point values
  • path like string but also accepts slashes
  • uuid accepts UUID strings
@app.route('/post/<int:post_id>')
def show_post(post_id):
    # show the post with the given id, the id is an integer
    return f'Post {post_id}'

@app.route('/path/<path:subpath>')
def show_subpath(subpath):
    # show the subpath after /path/
    return f'Subpath {escape(subpath)}'

64.6.2. parameters ?key=value

from flask import request
searchword = request.args.get('key', '')

64.8. gentoo dependencies

  • dev-python/asgiref - Asynchronous Server Gateway Interface - calling convention for web servers to forward requests to web applications or frameworks written in the Python
  • dev-python/blinker - fast dispatching system, to subscribe to events
  • dev-python/click - creating beautiful command line interfaces
  • dev-python/gpep517 - gentoo
  • dev-python/importlib_metadata - gentoo
  • dev-python/itsdangerous - helpers to pass data to untrusted environments and to get it back safe and sound
  • dev-python/jinja - template engine for Python
  • dev-python/pallets-sphinx-themes - ? themes for documentation
  • dev-python/pypy3 - fast, compliant alternative implementation of the Python (4.5 times faster than CPython)
  • dev-python/pytest - Simple powerful testing with Python - detailed assertion introspection
  • dev-python/setuptools - Easily download, build, install, upgrade, and uninstall Python packages
  • dev-python/sphinx - Python documentation generator
  • dev-python/sphinx-issues
  • dev-python/sphinx-tabs
  • dev-python/sphinxcontrib-log_cabinet
  • dev-python/werkzeug - Collection of various utilities for WSGI applications
  • dev-python/wheel - A built-package format for Python

64.9. blueprints

64.10. Hello world

import flask
from flask import Flask
from flask import json, Response, redirect, url_for
from markupsafe import escape


def create_app(test=False) -> Flask:
    app = Flask(__name__, template_folder='./', static_folder='./')
    if test:
        pass

    @app.route("/predict", methods=["POST"])
    def predict():
        data = {"success": False}

        if flask.request.method != "POST":
            json_string = json.dumps(data, ensure_ascii=False)
            return Response(json_string, content_type="application/json; charset=utf-8")

    @app.route("/<name>")
    def hello(name):
        return f"Hello, {escape(name)}!"

    @app.route('/', methods=['GET', 'POST'])
    def index():
        return redirect(url_for('transcribe'))

    return app


if __name__ == "__main__":
    app = create_app()
    app.run(debug=False)

64.11. curl

one string

application/x-www-form-urlencoded is the default:

curl -d "param1=value1&param2=value2" -X POST http://localhost:3000/data

explicit:

curl -d "param1=value1&param2=value2" -H "Content-Type: application/x-www-form-urlencoded" -X POST http://localhost:3000/dat

64.12. response object

default return:

  • string => 200 OK status code and a text/html mimetype
  • dict or list => jsonify() is called to produce a response
  • iterator or generator returning strings or bytes => streaming response
  • (response, status), (response, headers), or (response, status, headers)
    • headers : list or dictionary
  • other - assume the return is a WSGI application and convert that into a response object.

make_response:

from flask import make_response

@app.route('/')
def index():
    resp = make_response(render_template(...))
    resp.set_cookie('username', 'the username')
    return resp

64.13. request object

  • from flask import request

64.13.1. get all values

for x in dir(request):
    print(x, getattr(request, x))

64.14. Jinja templates

Jinja template library to render templates, located at 64.3.2

  • autoescape any data that is rendered in HTML templates - such as < and > will be escaped with safe value
  • {{ and }} - for output. a single trailing newline is stripped if present, other whitespace (spaces, tabs, newlines etc.) is returned unchanged
    • {{ name|striptags|title }} - equal to (title(striptags(name)))
  • {% and %} - control flow, and other Statements
    • {%+ if something %}yay{% endif %} or {% if something +%}yay{% endif %} - disabled block with +
    • {%- if something %}yay{% endif %} - the whitespaces before or after that block will be removed. used for {{ }} also
  • {# … #} for Comments not included in the template output
  • # for item in seq - line stiment, equivalent to {% for item in seq %}

common for {{}}

  • url_for('static', filename='style.css')

join paths:

{{path_join('pillar', 'device1.sls'}}

common for {%%}

  • {% if True %} yay {% endif %}
  • {% raw %} {% {% {% {% endraw %}
  • {% for user in users %} {{user.a}} {% endfor %}
  • {% include 'header.html' %}

64.14.1. own filters:

# 1 way
@app.template_filter('reverse')
def reverse_filter(s):
    return s[::-1]

# 2 way
def reverse_filter(s):
    return s[::-1]
app.jinja_env.filters['reverse'] = reverse_filter

app.jinja_env.filters['path_join'] = os.path.join
# usage: {{ path | path_join('..') }}

64.15. security

  • from markupsafe import escape; return f"Hello, {escape(name)}!"

werkzeug.secure_filename()

64.16. my projects

64.16.1. testing1

from main import app
from flask.testing import FlaskClient
from flask import Response
from pathlib import Path
import json
import  logging
# -- enable app.logger.debug()
app.logger.setLevel(logging.DEBUG)

app.testing = True # propaget excetions to here, or it will return 500 status only



client: FlaskClient
with app.test_client() as client:
    # -- get
    r: Response = client.get('/audio_captcha', follow_redirects=True)
    assert r.status_code == 200
    # the same:
    r: Response = client.get('/get' ,query_string = {'id':str('123')})
    r: Response = client.get('/get?id=123')
    # print(r.status_code)
    # -- post
    r: Response = client.post('/audio_captcha', data={
        'file': Path('/home/u2/h4/PycharmProjects/captcha_fssp/929014e341a0457f5a90a909b0a51c40.wav').open('rb')}
    )
    assert r.status_code == 200
    print(json.loads(r.data))


with app.test_request_context():
    print(url_for('index'))
    print(url_for('login'))
    print(url_for('login', next='/'))
    print(url_for('profile', username='John Doe'))

# /
# /login
# /login?next=/
# /user/John%20Doe


64.16.2. testing2

from main import app
from flask.testing import FlaskClient
from flask import Response
from pathlib import Path
app.testing = True
client: FlaskClient
import json


with app.test_client() as client:
    # r: Response = client.get('/speech_ru')
    # assert r.status_code == 200
    # print(r.status_code)

    r: Response = client.post('/speech_ru', data={
        'file': Path('/home/u2/h4/PycharmProjects/captcha_fssp/929014e341a0457f5a90a909b0a51c40.wav').open('rb')}
    )
    assert r.status_code == 200
    print(json.loads(r.data))

64.17. Flask-2.2.2 hashes

MarkupSafe==2.1.1 \
  --hash=sha256:7f91197cc9e48f989d12e4e6fbc46495c446636dfc81b9ccf50bb0ec74b91d4b

Jinja2==3.1.2 \
  --hash=sha256:31351a702a408a9e7595a8fc6150fc3f43bb6bf7e319770cbc0db9df9437e852

Werkzeug==2.2.2 \
  --hash=sha256:7ea2d48322cc7c0f8b3a215ed73eabd7b5d75d0b50e31ab006286ccff9e00b8f

click==8.1.3 \
  --hash=sha256:7682dc8afb30297001674575ea00d1814d808d6a36af415a82bd481d37ba7b8e

itsdangerous==2.1.2 \
  --hash=sha256:5dbbc68b317e5e42f327f9021763545dc3fc3bfe22e6deb96aaf1fc38874156a

importlib_metadata==5.0.0 \
  --hash=sha256:da31db32b304314d044d3c12c79bd59e307889b287ad12ff387b3500835fc2ab

zipp==3.8.1 \
  --hash=sha256:05b45f1ee8f807d0cc928485ca40a07cb491cf092ff587c0df9cb1fd154848d2
Flask==2.2.2 \
 --hash=sha256:642c450d19c4ad482f96729bd2a8f6d32554aa1e231f4f6b4e7e5264b16cca2b

64.18. flask-restful

  • flask-restful - complex API at the top of Flask API ( sucks)
  • flask-apispec inspired by Flask-RESTful and Flask-RESTplus, but attempts to provide similar functionality with greater flexibility and less code

?? https://github.com/mgorny/flask-api

marshal_with - declare serialization transformation for response https://flask-restful.readthedocs.io/en/latest/quickstart.html

64.19. example

from flask_restful import fields, marshal_with

resource_fields = {
    'task':   fields.String,
    'uri':    fields.Url('todo_ep')
}

class TodoDao(object):
    def __init__(self, todo_id, task):
        self.todo_id = todo_id
        self.task = task

        # This field will not be sent in the response
        self.status = 'active'


parser = reqparse.RequestParser()
parser.add_argument('task', type=str, help='Rate to charge for this resource')
parser.add_argument('picture', type=werkzeug.datastructures.FileStorage, required=True, location='files')


class Todo(Resource):
    @marshal_with(resource_fields)
    def get(self, todo_id):
        args = parser.parse_args()
        task = {'task': args['task']}
        file = args['file']
        file.save("your_file_name.jpg")
        if something:
            abort(404, message="Todo oesn't exist")
        return TodoDao(todo_id='my_todo', task='Remember the milk')

api.add_resource(Todo, '/todos/<todo_id>')

if __name__ == '__main__':
    app.run(debug=True)

64.19.1. image


64.20. swagger

  • flask_restx - same API as flask-restful but with Swagger autogeneration

flask_restx.reqparse.RequestParser.add_argument

64.21. werkzeug

64.22. debug

  1. run(debug=True) - create two applications
  2. localhost:8080/console
    • >> app.url_map
    • >> print(app.static_folder)

64.23. test

from flask.testing import FlaskClient
from flask import Response

from micro_file_server.__main__ import app


def test_main():
    app.testing = True
    with app.test_client() as client:
        client: FlaskClient
        r: Response = client.get('/')
        assert r.status_code == 200

64.24. production

built-in WSGI in Flask

  • not handle more than one request at a time by default.
  • If you leave debug mode on and an error pops up, it opens up a shell that allows for arbitrary code to be executed on your server

pdoction WSGI (Web Server Gateway Interface)

  • Gunicorn
  • Waitress
  • mod_wsgi
  • uWSGI
  • gevent
  • eventlet
  • ASGI

links

**

64.25. vulnerabilities

64.26. USECASES

Для возвращаемого значения создается

  • Response 200 OK, with the string as response body, text/html mimetype
  • (response, status, headers) or (response, headers)

64.26.1. check file exist

from flask import Flask
from flask import render_template
import os
app = Flask(__name__)
@app.route("/")
def main():
    app.logger.debug(os.path.exists(os.path.join(app.static_folder, 'staticimage.png')))
    app.logger.debug(os.path.exists(os.path.join(app.template_folder, 'index.html')))
    return render_template('index.html')

64.26.2. call POST method

request.files = {'file': open('/home/u/a.html', 'rb')}
request.method = 'POST'
r = upload()
# ('{"id": "35f190f6aa854b6c9bb0c64e601c0eda"}', 200, {'Content-Type': 'application/json'})

64.26.3. call GET method with arguments

request.args = {'id': rid}
r = get()
app.logger.debug("r " + json.dumps(json.loads(r[0]), indent=4))

64.26.4. print headers

from flask import Flask
print(__name__)
app = Flask(__name__, template_folder='./', static_folder='./')

from flask import render_template
from flask import abort, redirect, url_for
from flask import request
from werkzeug.utils import secure_filename


@app.route("/")
def hell():
    # return render_template('a.html')
    return ''.join([f"<br> {x[0]}: {x[1]}\n" for x in request.headers])

if __name__ == "__main__":
    print("start")
    app.run(host='0.0.0.0', port=80, debug=False)

64.26.5. TLS server

generate CSR (Creating the Server Certificate) used by CA to generate SSL

  • rm server.key ; openssl genrsa -out server.key 2048 && cp server.key server.key.org && openssl rsa -in server.key.org -out server.key
    • cp server.key server.key.org
    • openssl rsa -in server.key.org -out server.key
  • openssl req -new -key server.key -out server.csr

generate self-signed:

  • openssl x509 -req -days 365 -in server.csr -signkey server.key -out server.crt

CN must be full domain address

.well-known/pki-validation/926C419392B7B26DFCECBAEB9F163A53.txt

64.27. async/await and ASGI

Flask supports async coroutines for view functions by executing the coroutine on a separate thread instead of using an event loop on the main thread as an async-first (ASGI) framework would. This is necessary for Flask to remain backwards compatible with extensions and code built before async was introduced into Python. This compromise introduces a performance cost compared with the ASGI frameworks, due to the overhead of the threads.

you can run async code within a view, for example to make multiple concurrent database queries, HTTP requests to an external API, etc. However, the number of requests your application can handle at one time will remain the same.

64.28. use HTTPS

unstable certificate:

flask run --cert=adhoc

or

app.run(ssl_context='adhoc')

stable

  1. generate: openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 365
app.run(ssl_context=('cert.pem', 'key.pem'))

or

flask run --cert=cert.pem --key=key.pem

or

python micro_file_server/__main__.py --cert=.cert/cert.pem --key=.cert/key.pem

65. FastAPI

  • built-in data validation feature
  • error messages displayed in JSON format
  • anychronous task support - asyncio
  • documentation support - automatic
  • feature-rich: HTTPS requests, OAuth, XML/JSON response, TLS encryption
  • built-in monitoring tools
  • cons: expensive, difficult to scale

implement ASGI specification

66. Databases

66.1. Groonga

http://groonga.org/docs/ GNU Lesser General Public License v2.1

  • full text search engine based on inverted index
  • updates without read locks
  • column-oriented database management system
  • read lock-free
  • Geo-location (latitude and longitude) search

start:

  • apt-get install groonga
  • $ groonga -n grb.db - create database
  • $ groonga -s -p 10041 grb.db

0.0.0.0:10041

66.1.1. Basic commands:

status
shows status of a Groonga process.
table_list
shows a list of tables in a database.
column_list
shows a list of columns in a table.
table_create
adds a table to a database.
column_create
adds a column to a table.
select
searches records from a table and shows the result.
load
inserts records to a table.
table_create --name Site --flags TABLE_HASH_KEY --key_type ShortText
select --table Site
column_create --table Site --name gender --type UInt8
select Site --filter 'fuzzy_search(_key, "two")'

https://github.com/groonga/groonga/search?l=C&q=fuzzy_search

default:

  • data.max_distance = 1;
  • data.prefix_length = 0;
  • data.prefix_match_size = 0;
  • data.max_expansion = 0;

66.1.2. python

https://github.com/hhatto/poyonga

pip install --upgrade poyonga
groonga -s --protocol http grb.db
from poyonga import Groonga
g = Groonga(port=10041, protocol="http", host='0.0.0.0')
print(g.call("status").status)
# >>> 0
  1. load
    from poyonga import Groonga
    
    def _call(g, cmd, **kwargs):
        ret = g.call(cmd, **kwargs)
        print(ret.status)
        print(ret.body)
        if cmd == 'select':
            for item in ret.items:
                print(item)
            print("=*=" * 30)
    
    data = """\
    [
      {
        "_key": "one",
        "gender": 1,
      }
    ]
    """
    _call(g, "load", table="Site", values="".join(data.splitlines()))
    
    

66.2. Oracle

https://www.oracle.com/database/technologies/instant-client.html

python cx_Oracle

require: Oracle Instant Client - Basic zip, SQLPlus zip (for console)

.bashrc

export LD_LIBRARY_PATH=/home/u2/.local/instantclient_19_8:$LD_LIBRARY_PATH
wget https://download.oracle.com/otn_software/linux/instantclient/instantclient-basic-linuxx64.zip
unzip instantclient-basic-linuxx64.zip
apt-get install libaio1
export LD_LIBRARY_PATH=/instantclient_19_8:$LD_LIBRARY_PATH

66.2.1. sql

SELECT *
FROM
    nls_database_parameters
WHERE
    PARAMETER = 'NLS_NCHAR_CHARACTERSET';

DELETE FROM table - remove records
drop table - remove table

SELECT * FROM ALL_OBJECTS - system
SELECT * FROM v$version - oracle version

66.3. MySQL

67. Virtualenv

enables multiple side-by-side installations of Python, one for each project.

67.1. venv - default module

Creation of virtual environments is done by executing the command venv:

  1. python3 -m venv path
  2. source <venv>/bin/activate

67.2. virtualenv

  • pip3.6 install virtualenv –user
  • ~/.local/bin/virtualenv ENV
  • source ENV/bin/activate

68. ldap

apt-get install libsasl2-dev python-dev libldap2-dev libssl-dev

69. Containerized development

Docker

  • ENV values are available to containers
USER = os.getenv('API_USER')
PASSWORD = os.environ.get('API_PASSWORD')
os.environ['API_USER'] = 'username'
os.environ['API_PASSWORD'] = 'secret'

70. security

  • html.escape - <html> => &lt;html&gt;
  • from werkzeug.utils import secure_filename - request.files['the_file'].filename
  • 32.9 - 64.17

71. serialization

  • pickle (unsafe alone) + hmac
  • json
  • YAML: YAML is a superset of JSON, but easier to read (read & write, comparison of JSON and YAML)
  • csv
  • MessagePack (Python package): More compact representation (read & write)
  • HDF5 (Python package): Nice for matrices (read & write)
  • XML: exists too sigh (read & write)

71.1. pickle

# -- pandas pickle and csv --
import pickle
p: str = p
if p.endswith('.csv'):
    df = pd.read_csv(p, index_col=0, low_memory=False, nrows=nrows)
elif p.endswith('.pickle'):
    df: pd.DataFrame = pd.read_pickle(p)

# -- pickle
import pickle
with open('filename.pickle', 'wb') as fh:
    pickle.dump(a, fh, protocol=pickle.HIGHEST_PROTOCOL)
with open('filename.pickle', 'rb') as fh:
    b = pickle.load(fh)

72. cython

  • cython -3 –embed a.py
  • gcc `python3-config –cflags –ldflags` -lpython3.10 -fPIC -shared a.c

from doc:

gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing \
      -I/usr/include/python3.5 -o yourmod.so yourmod.c

73. headles browsers

74. selenium

  • Selenium WebDriver - interface to write instructions that work interchangeably across browsers, like a headless browser.
    • 1) Protocol specification
    • 2) Ruby official implementation for Protocol specification
    • 3) ChromeDriver, GeckoDriver - implementations of specification by Google and Mozilla. Most drivers are created by the browser vendors themselves
  • Selenium Remote Control (RC) (pip install selenium) simple? interface to browsers and to webdirever
  • Selenium IDE - browser plug-in, records your actions in the browser and repeats them.
  • Selenium Grid - allows you to run parallel tests on multiple machines and browsers at the same time
  • bindings for languages.

pros:

  • easily integrates with various development platforms such as Jenkins, Maven, TestNG, QMetry, SauceLabs, etc.

cons:

  • No built-in image comparison ( Sikuli is a common choice)
  • No tech support
  • No reporting capabilities
    • TestNG creates two types of reports upon test execution: detailed and summary. The summary provides simple passed/failed data; while detailed reports have logs, errors, test groups, etc.
    • JUnit uses HTML to generate simple reports in Selenium with indicators “failed” and “succeeded.”
    • Extent Library is the most complex option: It creates test summaries, includes screenshots, generates pie charts, and so on.
    • Allure creates beautiful reports with graphs, a timeline, and categorized test results — all on a handy dashboard.
  • well-coded Selenium test typically verifies less than 10% of the user interface

 web mobile apps. based on Selenium.

  • Selendroid focused exclusively on Android
  • Appium - iOS, Android, and Windows devices
  • Robotium — a black-box testing framework for Android
  • ios-driver—a Selenium WebDriver API for iOS testing integrated with Selenium Grid

74.3. python installantion

74.4. python usage

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://google.com")
for i in range(1):
    matched_elements = driver.get("https://www.google.com/search?q=" +
                                     search_string + "&start=" + str(i))

# driver.find_element_by_id("nav-search").send_keys("Selenium")

75. plot in terminal

75.1. plotext

https://github.com/piccolomo/plotext Нагрузка на воркерах 0 и 1 - 400 и 500:

pip install plotext
python3 -c "import plotext as plt; plt.bar([0,1],[400,500]) ; plt.show() ;"

76. xml parsing

import xml.etree.ElementTree as ET
xmlfile = "a.xml"
tree = ET.parse(xmlfile)
root = tree.getroot()
for child in root:
    print(child.tag, [x.tag for x in child], child.attrib)

77. pytest

77.1. features

[pytest] # pytest.ini (or .pytest.ini), pyproject.toml, tox.ini, or setup.cfg
testpaths = testing doc # as if $pytest testing doc

pytest -x # stop after first failure pytest –maxfail=2 # stop after two failures

77.2. layout

pyproject.toml
src/
    mypkg/
        __init__.py
        app.py
        view.py
tests/
    test_app.py
    test_view.py
    ...

77.3. usage

  1. cd project (with pyproject.toml and test folder)
  2. pytest [ foders … ] - packages should be added to PYTHONPATH manually
  3. or python -m pytest (this one add the current directory to sys.path) - current directory must be src or package(for flat)

77.4. dependencies

dev-python/pytest-7.3.2:
 [  0]  dev-python/pytest-7.3.2
 [  1]  dev-python/iniconfig-2.0.0
 [  1]  dev-python/more-itertools-9.1.0
 [  1]  dev-python/packaging-23.1
 [  1]  dev-python/pluggy-1.0.0-r2
 [  1]  dev-python/exceptiongroup-1.1.1
 [  1]  dev-python/tomli-2.0.1-r1
 [  1]  dev-python/pypy3-7.3.11_p1
 [  1]  dev-lang/python-3.10.11
 [  1]  dev-lang/python-3.11.3
 [  1]  dev-lang/python-3.12.0_beta2
 [  1]  dev-python/setuptools-scm-7.1.0
 [  1]  dev-python/argcomplete-3.0.8
 [  1]  dev-python/attrs-23.1.0
 [  1]  dev-python/hypothesis-6.76.0
 [  1]  dev-python/mock-5.0.2
 [  1]  dev-python/pygments-2.15.1
 [  1]  dev-python/pytest-xdist-3.3.1
 [  1]  dev-python/requests-2.31.0
 [  1]  dev-python/xmlschema-2.3.0
 [  1]  dev-python/gpep517-13
 [  1]  dev-python/setuptools-67.7.2
 [  1]  dev-python/wheel-0.40.0

77.5. fixtures - context for the test

fixtures can use other fixtures

import pytest

class Fruit:
    def __init__(self, name):
        self.name = name

    def __eq__(self, other):
        return self.name == other.name


@pytest.fixture
def my_fruit():
    return Fruit("apple")


@pytest.fixture
def fruit_basket(my_fruit):
    return [Fruit("banana"), my_fruit]


def test_my_fruit_in_basket(my_fruit, fruit_basket):
    assert my_fruit in fruit_basket

https://docs.pytest.org/en/latest/explanation/fixtures.html#what-fixtures-are

77.6. print

capture stdout and stderr to see only passed tests

pytest -s                  # disable all capturing

77.7. troubleshooting

ModuleNotFoundError: No module named 'micro_file_server'

  • solution 1: pyproject.toml:
[tool.pytest.ini_options]
pythonpath = [ "." ]

78. static analysis tools:

statis type checkers - mypy, Pyre

https://github.com/analysis-tools-dev/static-analysis#python

78.1. security

Common Vulnerabilities and Exposures (CVE)

  • CVEs - We can count them and fix them
  • SCA - composition analysis tools.
    • Mostly signature based
    • 3rd party and our own
  • vulnerabilities

Things that probably won’t hurt us

  • Good habits/code hygiene
  • Active development
  • Developers we trust
  • CVE and SCA clear

78.2. mypy

reveal_type() - To find out what type mypy infers for an expression anywhere in your program.

78.2.1. emacs fix

mypy /dev/stdin

78.2.2. ex

import random
from typing import Sequence, TypeVar

Choosable = TypeVar("Choosable", str, float)

def choose(items: Sequence[Choosable]) -> Choosable:
    return random.choice(items)

reveal_type(choose(["Guido", "Jukka", "Ivan"]))
reveal_type(choose([1, 2, 3]))
reveal_type(choose([True, 42, 3.14]))
reveal_type(choose(["Python", 3, 7]))
/dev/stdin:14: note: Revealed type is "builtins.str"
/dev/stdin:16: note: Revealed type is "builtins.float"
/dev/stdin:18: note: Revealed type is "builtins.float"
/dev/stdin:20: error: Value of type variable "Choosable" of "choose" cannot be "object"  [type-var]
/dev/stdin:20: note: Revealed type is "builtins.object"
Found 1 error in 1 file (checked 1 source file)

79. release as execuable - Pyinstaller

80. troubleshooting

def a(l:dir = []):

  1. If the user provides an empty list your version will not use that list but instead create a new one, because an empty list is "falsy"
  2. empty list is created just once when the function is defined, not every time the function is called.

python tests/test_main.py - ModuleNotFoundError: No module named

  • solution: PYTHONPATH=. python tests/test_main.py

Created: 2024-03-03 Sun 09:57

Validate